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Aos meus pais, Francisco de Oliveira Couto e 
Maria Fernanda dos Santos Moreira Couto. 


During the last decades, I witnessed the growing importance of computer 
science skills for career advancement in Health and Life Sciences. However, 
not everyone has the skill, inclination, or time to learn computer program- 
ming. The learning process is usually time-consuming and requires constant 
practice, since software frameworks and programming languages change 
substantially overtime. This is the main motivation for writing this book about 
using shell scripting to address common biomedical data and text processing 
tasks. Shell scripting has the advantages of being: (1) nowadays available 
in almost all personal computers; (ii) almost immutable for more than four 
decades; (iii) relatively easy to learn as a sequence of independent commands; 
(iv) an incremental and direct way to solve many of the data problems that 
Health and Life professionals face. 

During the last decades, I had the pleasure to teach introductory computer 
science classes to Life and Health and Life Sciences undergraduates. I used 
programming languages, such as Perl and Python, to address data and text 
processing tasks, but I always felt to lose a substantial amount of the time 
teaching the technicalities of these languages, which will probably change 
over time and are uninteresting for the majority of the students who do not 
intend to pursue advanced bioinformatics courses. Thus, the purpose of this 
book is to motivate and help specialists to automate common data and text 
processing tasks after a short learning period. If they become interested (and 
I hope some do), the book presents pointers to where they can acquire more 
advanced computer science skills. 

This book does not intend to be a comprehensive compendium of shell 
scripting commands but instead an introductory guide for Health and Life 
specialists. This book introduces the commands as they are required to 
automate data and text processing tasks. The selected tasks have a strong 
focus on text mining and biomedical ontologies given my research experience 
and their growing relevance for Health and Life studies. Nevertheless, the 
same type of solutions presented in the book are also applicable to many 
other research fields and data sources. 


Lisboa, Portugal Francisco M. Couto 
January 2019 
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Abstract 

Health and Life studies are well known 
for the huge amount of data they produce, 
such as high-throughput sequencing projects 
(Stephens et al., PLoS Biol 13(7):e1002195, 
2015; Hey et al., The fourth paradigm: 
data-intensive scientific discovery, vol 1. 
Microsoft research Redmond, Redmond, 
2009). However, the value of the data should 
not be measured by its amount, but instead 
by the possibility and ability of researchers to 
retrieve and process it (Leonelli, Data-centric 
biology: a philosophical study. University of 
Chicago Press, Chicago, 2016). Transparency, 
openness, and reproducibility are key aspects 
to boost the discovery of novel insights 
into how living systems work (Nosek et al., 
Science 348(6242): 1422-1425, 2015). 
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Biomedical Data Repositories 


Fortunately, a significant portion of the 
biomedical data is already being collected, 
integrated and distributed through Biomed- 
ical Data Repositories, such as European 
Bioinformatics Institute (EBI) and National 
Center for Biotechnology Information (NCBI) 
repositories (Cook et al. 2017; Coordinators 
2018). Nonetheless, researchers cannot rely on 
available data as mere facts, they may contain 
errors, can be outdated, and may require a 
context (Ferreira et al. 2017). Most facts are only 
valid in a specific biological setting and should 
not be directly extrapolated to other cases. In 
addition, different research communities have 
different needs and requirements, which change 
over time (Tomczak et al. 2018). 


Scientific Text 


Structured data is what most computer applica- 
tions require as input, but humans tend to prefer 
the flexibility of text to express their hypoth- 
esis, ideas, opinions, conclusions (Barros and 
Couto 2016). This explains why scientific text 
is still the preferential means to publish new 
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discoveries and to describe the data that support 
them (Holzinger et al. 2014; Lu 2011). Another 
reason is the long-established scientific reward 
system based on the publication of scientific 
articles (Rawat and Meena 2014). 


Amount of Text 


The main problem of analyzing biomedical text 
is the huge amount of text being published every 
day (Hersh 2008). For example, 813,598 cita- 
tions! were added in 2017 to MEDLINE, a bibli- 
ographic database of Health and Life literature’. 
If we read 10 articles per day, it will take us takes 
more than 222 years to just read those articles. 
Figure 1.1 presents the number of citations added 
to MEDLINE in the past decades, showing the 
increasing large amount of biomedical text that 
researchers must deal with. 

Moreover, scientific articles are not the only 
source of biomedical text, for example clinical 
studies and patents also provide a large amount 
of text to explore. They are also growing at a fast 
pace, as Figs. 1.2 and 1.3 clearly show (Aras et al. 
2014; Jensen et al. 2012). 


Ambiguity and Contextualization 


Given the high flexibility and ambiguity of natu- 
ral language, processing and extracting informa- 
tion from texts is a painful and hard task, even 
to humans. The problem is even more complex 
when dealing with scientific text, that requires 
specialized expertise to understand it. The major 
problem with Health and Life Sciences is the in- 
consistency of the nomenclature used for describ- 
ing biomedical concepts and entities (Hunter and 
Cohen 2006; Rebholz-Schuhmann et al. 2005). In 
biomedical text, we can often find different terms 
referring to the same biological concept or entity 
(synonyms), or the same term meaning different 


!https://www.nlm.nih.gov/bsd/index stats comp.html 
*https://www.nlm.nih.gov/bsd/medline.html 
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biological concepts or entities (homonyms). For 
example, many times authors improve the read- 
ability of their publications by using acronyms to 
mention entities, that may be clear for experts on 
the field but ambiguous in another context. 

The second problem is the complexity of the 
message. Almost everyone can read and under- 
stand a newspaper story, but just a few can really 
understand a scientific article. Understanding the 
underlying message in such articles normally 
requires years of training to create in our brain 
a semantic model about the domain and to know 
how to interpret the highly specialized terminol- 
ogy specific to each domain. Finally, the mul- 
tilingual aspect of text is also a problem, since 
most clinical data are produced in the native 
language (Campos et al. 2017). 


Biomedical Ontologies 


To address the issue of ambiguity of natural 
language and contextualization of the message, 
text processing techniques can explore current 
biomedical ontologies (Robinson and Bauer 
2011). These ontologies can work as vocabularies 
to guide us in what to look for (Couto et al. 
2006). For example, we can select an ontology 
that models a given domain and find out which 
official names and synonyms are used to mention 
concepts in which we have an interest (Spasic 
et al. 2005). Ontologies may also be explored 
as semantic models by providing semantic 
relationships between concepts (Lamurias et al. 
2017). 


Programming Skills 


The success of biomedical studies relies on over- 
coming data and text processing issues to take the 
most of all the information available in biomed- 
ical data repositories. In most cases, biomedical 
data analysis is no longer possible using an in- 
house and limited dataset, we must be able to 
efficiently process all this data and text. So, a 
common question that many Health and Life 
specialists face is: 
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Fig. 1.1 Chronological listing of the total number of citations in MEDLINE (Source: https://www.nlm.nih.gov/bsd/) 
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Fig. 1.3 Chronological listing of the total number of patents in force (Source: WIPO statistics database http://www. 


wipo.int/ipstats/en/) 


How can I deal with such huge amount of 
data and text the necessary expertise, time 
and disposition to learn computer program- 
ming? 


This is the goal of this book, to provide a low- 
cost, long-lasting, feasible and painless answer to 
this question. 


Why This Book? 


State-of-the-art data and text processing tools 
are nowadays based on complex and sophisti- 
cated technologies, and to understand them we 
need to have special knowledge on program- 
ming, linguistics, machine learning or deep learn- 
ing (Holzinger and Jurisica 2014; Ching et al. 
2018; Angermueller et al. 2016). Explaining their 
technicalities or providing a comprehensive list 
of them are not the purpose of this book. The 
tools implementing these technologies tend to 


be impenetrable to the common Health and Life 
specialists and usually become outdated or even 
unavailable some time after their publication or 
the financial support ends. Instead, this book will 
equip the reader with a set of skills to process text 
with minimal dependencies to existing tools and 
technologies. The idea is not to explain how to 
build the most advanced tool, but how to create 
a resilient and versatile solution with acceptable 
results. 

In many cases, advanced tools may not be 
most efficient approach to tackle a specific prob- 
lem. It all depends on the complexity of problem, 
and the results we need to obtain. Like a good 
physician knows that the most efficient treatment 
for a specific patient is not always the most 
advanced one, a good data scientist knows that 
the most efficient tool to address a specific infor- 
mation need is not always the most advanced one. 
Even without focusing on the foundational basis 
of programming, linguistics or artificial intelli- 
gence, this book provides the basic knowledge 
and right references to pursue a more advanced 
solution if required. 


How This Book Helps Health and Life Specialists? 
Third-Party Solutions 


Many manuscripts already present and discuss 
the most recent and efficient text mining 
techniques and the available software solutions 
based on them that users can use to process data 
and text (Cock et al. 2009; Gentleman et al. 2004; 
Stajich et al. 2002). These solutions include 
stand-alone applications, web applications, 
frameworks, packages, pipelines, etc. A common 
problem with these solutions is their resiliency 
to deal with new user requirements, to changes 
on how resources are being distributed, and to 
software and hardware updates. Commercial 
solutions tend to be more resilient if they have 
enough customers to support the adaptation 
process. But of course we need the funding 
to buy the service. Moreover, we will be still 
dependent on a third-party availability to address 
our requirements that are continuously changing, 
which vary according to the size of the company 
and our relevance as client. 

Using open-source solutions may seem a great 
alternative since we do not need to allocate fund- 
ing to use the service and its maintenance is as- 
sured by the community. However, many of these 
solutions derive from academic projects that most 
of the times are highly active during the funding 
period and then fade away to minimal updates. 
The focus of academic research is on creating 
new and more efficient methods and publish 
them, the software is normally just a means to 
demonstrate their breakthroughs. In many cases 
to execute the legacy software is already a non- 
trivial task, and even harder is to implement 
the required changes. Thus, frequently the most 
feasible solution is to start from scratch. 


Simple Pipelines 


If we are interested in learning sophisticated and 
advanced programming skills, this is not the right 
book to read. This book aims at helping Health 
and Life specialists to process data and text by 
describing a simple pipeline that can be executed 
with minimal software dependencies. Instead of 
using a fancy web front-end, we can still man- 


ually manipulate our data using the spreadsheet 
application that we already are comfortable with, 
and at the same time be able to automatize some 
of the repetitive tasks. 


In summary, this book is directed mainly 
towards Health and Life specialists and 
students that need to know how to process 
biomedical data and text, without being 
dependent on continuous financial support, 
third-party applications, or advanced com- 
puter skills. 


How This Book Helps Health and 
Life Specialists? 


So, if this book does not focus on learning pro- 
gramming skills, and neither on the usage of any 
special package or software, how it will help 
specialists processing biomedical text and data? 


Shell Scripting 


The solution proposed in this book has been 
available for more than four decades (Ritchie 
1971), and it can now be used in almost every 
personal computer (Haines 2017). The idea is to 
provide an example driven introduction to shell 
scripting? that addresses common challenges in 
biomedical text processing using a Unix shell". 
Shells are software programs available in Unix 
operating systems since 19715, but nowadays are 
available is most of our personal computers using 
Linux, macOS or Windows operating systems. 


But a shell script is still a computer algo- 
rithm, so how is it different from learning 
another programming language? 


3https://en.wikipedia.org/wiki/Shell_script 
^https://en.wikipedia.org/wiki/Unix shell 
Shttps://www.in-ulm.de/~mascheck/bourne/#origins 


It is different in the sense that most solutions 
are based on the usage of single command line 
tools, that sometimes are combined as simple 
pipelines. This book does not intend to create 
experts in shell scripting, by the contrary, the few 
scripts introduced are merely direct combinations 
of simple command line tools individually ex- 
plained before. 

The main idea is to demonstrate the ability of 
a few command line tools to automate many of 
the text and data processing tasks. The solutions 
are presented in a way that comprehending them 
is like conducting a new laboratory protocol i.e. 
testing and understanding its multiple procedural 
steps, variables, and intermediate results. 


Text Files 


All the data will be stored in text files, which 
command line tools are able to efficiently pro- 
cess (Baker and Milligan 2014). Text files repre- 
sent a simple and universal medium of storing our 
data. They do not require any special encoding 
and can be opened and interpreted by using 
any text editor application. Normally, text files 
without any kind of formatting are stored using 
a txt extension. However, text files can contain 
data using a specific format, such as: 


CSV : Comma-Separated Values; 
TSV : Tab-Separated Values’; 
XML : eXtensible Markup Language’. 


All the above formats can be open (import), 
edited and saved (export) by any text editor appli- 
cation. and common spreadsheet applications’, 
such as LibreOffice Calc or Microsoft Excel!?. 
For example, we can create a new data file using 
LibreOffice Calc, like the one in Fig. 1.4. Then 
we select the option to save it as CSV, TSV, XML 


Éhttps://en.wikipedia.org/wiki/Comma-separated. values 
Thttps://en.wikipedia.org/wiki/Tab-separated_values 
Shttps://en.wikipedia.org/wiki/XML 
?https:;//en.wikipedia.org/wiki/Spreadsheet 


10To save in TSV format using the LibreOffice Calc, we 
may have to choose CSV format and then select as field 
delimiter the tab character. 


1 Introduction 


Fig. 1.4 Spreadsheet example 


(Microsoft 2003), and XLS (Microsoft 2003) 
formats. We can try to open all these files in our 
favorite text editor. 

When opening the CSV file, the application 
will show the following contents: 


A,C 
G,T 


Each line represents a row of the spreadsheet, and 
column values are separated by commas. 

When opening the TSV file, the application 
will show the following contents: 


AC 
GT 


The only difference is that instead of a comma 
it is now used a tab character to separate column 
values. 

When opening the XML file, the application 
will show the following contents: 


«Table ss:StyleID="tal"> 

«Column ss:Span="1" ss:Width=" 
64.01"/> 

<Row ss:Height="12.81"><Cell>< 
Data ss:Type="String">A</Data 
></Cell><Cell><Data ss:Type=" 
String">C</Data></Cell></Row> 

<Row ss:Height="12.81"><Cell>< 
Data ss:Type="String">G</Data 
></Cell><Cell><Data ss:Type=" 
String">T</Data></Cell></Row> 

</Table> 


Now the data is more complex to find and under- 
stand, but with a little more effort we can check 
that we have a table with two rows, each one with 
two cells. 

When opening the XLS file, we will get a 
lot of strange characters and it is humanly im- 
possible to understand what data it is storing. 


What Is in the Book? 


This happens because XLS is not a text file is a 
proprietary format!!, which organizes data using 
an exclusive encoding scheme, so its interpreta- 
tion and manipulation could only be done using a 
specific software application. 

Comma-separated values is a data format so 
old as shell scripting, in 1972 it was already 
supported by an IBM product". Using CSV or 
TSV enables us to manually manipulate the data 
using our favorite spreadsheet application, and 
at the same time use command line tools to 
automate some of the tasks. 


Relational Databases 


If there is a need to use more advanced data 
storage techniques, such as using a relational 
database!?, we may still be able to use shell 
scripting if we can import and export our data 
to a text format. For example, we can open 
a relational database, execute Structured Query 
Language (SQL) commands'^, and import and 
export the data to CSV using the command line 
tool sq1ite3P. 

Besides CSV and shell scripting being al- 
most the same as they were four decades ago, 
they are still available everywhere and are able 
to solve most of our data and text processing 
daily problems. So, these tools are expected to 
continue to be used for many more decades to 
come. As a bonus, we will look like a true 
professional typing command line instructions in 
a black background window ! — 


What Is in the Book? 


First, the Chap.2 presents a brief overview of 
some of the most prominent resources of biomed- 
ical data, text, and semantics. The chapter dis- 
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cusses what type of information they distribute, 
where we can find them, and how we will be 
able to automatically explore them. Most of the 
examples in the book use the resources pro- 
vided by the European Bioinformatics Institute 
(EBI) and use their services to automatically 
retrieve the data and text. Nevertheless, after 
understanding the command line tools, it will 
not be hard to adapt them to the formats used 
by other service provider, such as the National 
Center for Biotechnology Information (NCBI). 
In terms of semantics, the examples will use 
two ontologies, one about human diseases and 
the other about chemical entities of biological 
interest. Most ontologies share the same structure 
and syntax, so adapting the solutions to other 
domains are expected to be painless. 

As an example, the Chap. 3 will describe the 
manual steps that Health and Life specialists may 
have to perform to find and retrieve biomedi- 
cal text about caffeine using publicly available 
resources. Afterwards, these manual steps will 
be automatized by using command line tools, 
including the automatic download of data. The 
idea is to go step-by-step and introduce how each 
command line tool can be used to automate each 
task. 


Command Line Tools 


'The main command line tools that this book will 
introduce are the following: 


* curl:a tool to download data and text from 
the web; 

* grep:atool to search our data and text; 

* gawk:atool to manipulate our data and text; 

* ged:a tool to edit our data and text; 

* xargs: a tool to repeat the same step for 
multiple data items; 

e xmllint:a tool to search in XML data files. 


Other command line tools are also presented 
to perform minor data and text manipulations, 


such as: 


* cat:atool to get the content of file; 


tr: a tool to replace one character by another; 
e sort: a tool to sort multiple lines; 
head: a tool to select only the first lines. 


Pipelines 


A fundamental technique introduced in Chap. 3 
is how to redirect the output of a command line 
tool as input to another tool, or to a file. This en- 
ables the construction of pipelines of sequential 
invocations of command line tools. Using a few 
commands integrated in a pipeline is really the 
maximum shell scripting that this book will use. 
Scripts longer than that would cross the line of 
not having to learn programming skills. 

Chapter 4 is about extracting useful informa- 
tion from the text retrieved previously. The ex- 
ample consists in finding references to malignant 
hyperthermia in these caffeine related texts, so 
we may be able to check any valid relation. 


Regular Expressions 


A powerful pattern matching technique described 
in this chapter is the usage of regular expres- 
sions!° in the grep command line tool to per- 
form Named-Entity Recognition (NER)!’. Regu- 
lar expressions originated in 1951 (Kleene 1951), 
so they are even older than shell scripting, but 
still popular and available in multiple software 
applications and programming languages (Forta 
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2018). A regular expression is a string that in- 
clude special operators represented by special 
characters. For example, the regular expression 
A|C|G|T will identify in a given string any of 
the four nucleobases adenine (A), cytosine (C), 
guanine (G), or thymine (T). 

Another technique introduced is tokenization. 
It addresses the challenge of identifying the text 
boundaries, such as splitting a text into sentences. 
So, we can keep only the sentences that may have 
something we want. Chapter 4 also describes how 
can we try to find two entities in the same sen- 
tence, providing a simple solution to the relation 
extraction challenge!?. 


Semantics 


Instead of trying to recognize a limited list of 
entities, Chap. 5 explains how can we use ontolo- 
gies to construct large lexicons that include all the 
entities of a given domain, e.g. humans diseases. 
The chapter also explains how the semantics 
encoded in an ontology can be used to expand a 
search by adding the ancestors and related classes 
of a given entity. Finally, a simple solution to 
the Entity Linking? challenge is given, where 
each entity recognized is mapped to a class in 
an ontology. A simple technique to solve the 
ambiguity issue when the same label can be 
mapped to more than one class is also briefly 
presented. 


'8https://en.wikipedia.org/wiki/Relationship_extraction 
'9https://en.wikipedia.org/wiki/Entity_linking 
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Abstract 

The previous chapter presented the impor- 
tance of text and semantic resources for Health 
and Life studies. This chapter will describe 
what kind of text and semantic resources are 
available, where they can be found, and how 
they can be accessed and retrieved. 
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Biomedical Text 


Text is still the preferential means of publishing 
novel knowledge in Health and Life Sciences, 
and where we can expect to find all the 
information about the supporting data. Text 
can be found and explored in multiple types 
of sources, the main being scientific articles and 
patents (Krallinger et al. 2017). However, less 
formal texts are also relevant to explore, such as 
the ones present nowadays in electronic health 
records (Blumenthal and Tavenner 2010). 
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What? 


In the biomedical domain, we can find text in 
different forms, such as: 


Statement: a short piece of text, normally con- 
taining personal remarks or an evidence about 
a biomedical phenomenon; 


Abstract: a short summary of a larger scientific 
document; 
Full-text: the entire text present in a scientific 


document including scattered text such as fig- 
ure labels and footnotes. 


Statements contain more syntactic and semantic 
errors than abstracts, since they normally are 
not peer-reviewed, but they are normally directly 
linked to data providing useful details about it. 
The main advantage of using statements or ab- 
stracts is the brief and succinct form on which 
the information is expressed. In the case of ab- 
stracts, there was already an intellectual exercise 
to present only the main facts and ideas. Never- 
theless, a brief description may be insufficient to 
draw a solid conclusion, that may require some 
important details not possible to summarize in a 
short piece of text (Schuemie et al. 2004). These 
details are normally presented in the form of a 
full-text document, which contains a complete 
description of the results obtained. For example, 
important details are sometimes only present in 
figure labels (Yeh et al. 2003). 
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One major problem of full-text documents is 
their availability, since their content may have 
restricted access. In addition, the structure of the 
full-text and the format on which is available 
varies according to the journal in where it was 
published. Having more information does not 
mean that all of it is beneficial to find what 
we need. Some of the information may even 
induce us in error. For example, the relevance 
of a fact reported in the Results Section may be 
different if the fact was reported in the Related 
Work Section. Thus, the usage of full-text may 
create several problems regarding the quality of 
information extracted (Shah et al. 2003). 


Where? 


Access to biomedical literature is normally done 
using the internet through PubMed!, an informa- 
tion retrieval system released in 1996 that allows 
researchers to search and find biomedical texts of 
relevance to their studies (Canese 2006). PubMed 
is developed and maintained by the National 
Center for Biotechnology Information (NCBI), at 
the U.S. National Library of Medicine (NLM), 
located at the National Institutes of Health (NIH). 
Currently, PubMed provides access to more than 
28 million citations from MEDLINE, a biblio- 
graphic database with references to a compre- 
hensive list of academic journals in Health and 
Life Sciences?. The references include multiple 
metadata about the documents, such as: title, ab- 
stract, authors, journal, publication date. PubMed 
does not store the full-text documents, but it 
provides links where we may find the full-text. 
More recently, biomedical references are also 
accessible using the European Bioinformatics 
Institute (EBI) services, such as Europe PMC?, 
the Universal Protein Resource (UniProt) with its 
UniProt citations service’. 

Other generic alternative tools have been also 
gaining popularity for finding scientific texts, 


"https://www.nlm.nih.gov/bsd/pubmed.html 
*https://www.nlm.nih.gov/bsd/medline.html 
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such as Google Scholar’, Google Patents?, Re- 
searchGate’ and Mendeley?. 

More than just text some tools also integrate 
semantic links. One of the first search engines 
for biomedical literature to incorporate semantics 
was GOPubMed?, that categorized texts accord- 
ing to Gene Ontology terms found in them (Doms 
and Schroeder 2005). These semantic resources 
will be described in a following section. A more 
recent tool is PubTator!® that provides the text 
annotated with biological entities generated by 
state-of-the-art text-mining approaches (Wei 
et al. 2013). 

There is also a movement in the scientific 
community to produce Open Access Publica- 
tions, making full-texts freely available with 
unrestricted use. One of the main free digital 
archives of free biomedical full-texts is PubMed 
Central!! (PMC), currently providing access to 
more than 5 million documents. 

Other relevant source of biomedical texts is 
the electronic health records stored in health in- 
stitutions, but the texts they contain are normally 
directly linked to patients and therefore their 
access is restricted due to ethical and privacy is- 
sues. As example, the THYME corpus"? includes 
more than one thousand de-identified clinical 
notes from the Mayo Clinic, but is only available 
for text processing research under a data use 
agreement (DUA) with Mayo Clinic (Styler IV 
et al. 2014). 

From generic texts we can also sometimes find 
relevant biomedical information. For example, 
some recent biomedical studies have been pro- 
cessing the texts in social networks to identify 
new trends and insights about a disease, such as 
processing tweets to predict flu outbreaks (Ara- 
maki et al. 2011). 


Shttp://scholar.google.com/ 
Shttp://www.google.com/patents 
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Semantics 
How? 


To automatically process text, we need program- 
matic access to it, this means that from the pre- 
vious biomedical data repositories we can only 
use the ones that allow this kind of access. These 
limitations are imposed because many biomed- 
ical documents have copyright restrictions hold 
by their publishers. And some restrictions may 
define that only manual access is granted, and no 
programmatic access is allowed. These restric- 
tions are normally detailed in the terms of service 
of each repository. However, when browsing the 
repository if we face a CAPTCHA challenge 
to determine whether we are humans or not, 
probably means that some access restrictions are 
in place. 

Fortunately, NCBI? and EBI'^ online ser- 
vices, such as PubMed, Europe PMC, or UniProt 
Citations, allow programmatic access (Li et al. 
2015). Both institutions provide Web APIs? that 
fully document how web services can be pro- 
grammatically invoked. Some resources can in- 
clusively be accessed using RESTful web ser- 
vices!® that are characterized by a simple uniform 
interface that make any Uniform Resource Lo- 
cator (URL) almost self-explanatory (Richardson 
and Ruby 2008). The same URL shown by our 
web browser is the only thing we need to know 
to retrieve the data using a command line tool. 

For example, if we search for caffeine using 
the UniProt Citations service!’, select the first 
two entries, and click on download, the browser 
will show information about those two docu- 
ments using a tabular format. 


PubMed ID Title Authors/Groups 
Abstract /Summary 
27702941 Genome-wide association 


22333316 Modeling caffeine 
concentrations 


'Shttps://www.ncbi.nlm.nih.gov/home/develop/api/ 
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More important is to check the URL that is 
now being used: 


https://www.uniprot.org/ 
citations/?sort=score&desc=& 
compress-no&query-id 
:27702941$200R220id:22333316& 
format-tab&columns-id 


We can check that the URL has three 
main components: the scheme (https), the 
hostname (www.uniprot.org), the service 
(citations) and the data parameters. The 
scheme represents the type of web connection to 
get the data, and usually is one of these protocols: 
Hypertext Transfer Protocol (HTTP) or HTTP 
Secure (HTTPS)!?. The hostname represents the 
physical site where the service is available. The 
list of parameters depends on the data available 
from the different services. We can change 
any value of the parameters (arguments) to get 
different results. For example, we can replace 
the two PubMed identifiers by the following one 
29029291!°, and our browser will now display 
the information about this new document: 


PubMed ID Title Authors/Groups 
Abstract/Summary 
29029291 Nutrition Influences... 


The good news is that we can use this link with 
a command line tool and automatize the retrieval 
of the data, including extracting the abstract to 
process its text. 


Semantics 


Lack of use of standard nomenclatures across bi- 
ological text makes text processing a non-trivial 
task. Often, we can find different labels (syn- 
onyms, acronyms) for the same biomedical enti- 
ties, or, even more problematic, different entities 
sharing the same label (homonyms) (Rebholz- 
Schuhmann et al. 2005). Sense disambiguation 
to select the correct meaning of an expression in 


'8https://en.wikipedia.org/wiki/ 
Hypertext_Transfer_Protocol 
'https://www.uniprot.org/citations/?sort=score&desc= 
&compress=no&query=id:2902929 | &format= 
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a given piece of text is therefore a crucial issue. 
For example, if we find the disease acronym ATS 
in a text, we may have to figure out if it repre- 
senting the Andersen-Tawil syndrome” or the X- 
linked Alport syndrome?! . Further in the book, we 
will address this issue by using ontologies and 
semantic similarity between their classes (Couto 
and Lamurias 2019). 


What? 


In 1993, Gruber (1993) proposed a short but 
comprehensive definition of ontology as an: 


an explicit specification of a conceptualization 


In 1997 and 1998, Borst and Borst (1997) and 
Studer et al. (1998) refined this definition to: 


a formal, explicit specification of a shared concep- 
tualization 


A conceptualization is an abstract view of 
the concepts and the relationships of a given 
domain. A shared conceptualization means that a 
group of individuals agree on that view, normally 
established by a common agreement among the 
members of a community. The specification is a 
representation of that conceptualization using a 
given language. The language needs to be formal 
and explicit, so computers can deal with it. 


Languages 

The Web Ontology Language (OWL)? is 
nowadays becoming one of the most common 
languages to specify biomedical ontologies 
(McGuinness et al. 2004). Another popular alter- 
native is the Open Biomedical Ontology (OBO)? 
format developed by the OBO foundry. OBO 
established a set of principles to ensure high 
quality, formal rigor and interoperability between 
other OBO ontologies (Smith et al. 2007). One 
important principle is that OBO ontologies need 


20http://purl.obolibrary.org/obo/DOID_0050434 
?!http://purl.obolibrary.org/obo/DOID 0110034 
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Web_Ontology_Language 

25 https:;//en.wikipedia.org/wiki/ 

Open Biomedical Ontologies 


2 Resources 


to be open and available without any constraint 
other than acknowledging their origin. 

Concepts are defined as OWL classes that may 
include multiple properties. For text processing 
important properties include the labels that may 
be used to mention that class. The labels may 
include the official name, acronyms, exact syn- 
onyms, and even related terms. For example, a 
class defining the disease malignant hyperther- 
mia may include as synonym anesthesia related 
hyperthermia. Two distinct classes may share the 
same label, such as Andersen-Tawil syndrome 
and X-linked Alport syndrome that have ATS as 
an exact synonym. 


Formality 
The representation of classes and the relation- 
ships may use different levels of formality, such 
as controlled vocabularies, taxonomies and the- 
saurus, that even may include logical axioms. 
Controlled vocabularies are list of terms with- 
out specifying any relation between them. Tax- 
onomies are controlled vocabularies that include 
subsumption relations, for example specifying 
that malignant hyperthermia is a muscle tissue 
disease. This is-a or subclass relations are nor- 
mally the backbone of ontologies. We should 
note that some ontologies may include multi- 
ple inheritance, i.e. the same concept may be a 
specialization of two different concepts. There- 
fore, many ontologies are organized as a directed 
acyclic graphs (DAG) and not as hierarchical 
trees, as the one represented in Fig. 2.1. A the- 
saurus includes other types of relations besides 
subsumption, for example specifying that caf- 
feine has role mutagen. 


Gold Related Documents 
The importance of these relations can be easily 
understood by considering the domain modeled 
by the ontology in Fig. 2.1, and the need to find 
texts related to gold. Assume a corpus with one 
distinct document mentioning each metal, except 
for gold that no document mentions. So, which 
documents should we read first? 

The document mentioning silver is probably 
the most related since it shares with gold two 
parents, precious and coinage. However, choos- 


Semantics 


Fig. 2.1 ADAG 
representing a 
classification of metals 
with multiple inheritance, 
since gold and silver are 
considered both precious 
and coinage metals (All the 
links represent is-a 
relations) 


metal 


precious coinage 


platinum 


palladium 


gold silver copper 


ing between the documents mentioning platinum 
or palladium or the document mentioning copper 
depends on our information need. This informa- 
tion can be obtained by our previous searches 
or reads. For example, assuming that our last 
searches included the word coinage, then docu- 
ment mentioning copper is probably the second- 
most related. The importance of these semantic 
resources is evidenced by the development of the 
knowledge graph”* by Google to enhance their 
search engine (Singhal 2012). 


Where? 


Most of the biomedical ontologies are available 
through BioPortal”. In December of 2018, Bio- 
Portal provided access to more than 750 ontolo- 
gies representing more than 9 million classes. 
BioPortal allows us to search for an ontology or 
a specific class. For example, if we search for 
caffeine, we will be able to see the large list of on- 
tologies that define it. Each of these classes rep- 
resent conceptualizations of caffeine in different 
domains and using alternative perspectives. To 
improve interoperability some ontologies include 
class properties with a link to similar classes 
in other ontologies. One of the main goals of 


*4https://en.wikipedia.org/wiki/Knowledge_Graph 
25http://bioportal.bioontology.org/ 


the OBO initiative was precisely to tackle this 
somehow disorderly spread of definitions for the 
same concepts. Each OBO ontology covers a 
clearly specified scope that is clearly identified. 


OBO Ontologies 

A major example of success of OBO ontologies 
is the Gene Ontology (GO) that has been widely 
and consistently used to describe the molecular 
function, biological process and cellular compo- 
nent of gene-products, in a uniform way across 
different species (Ashburner et al. 2000). Another 
OBO ontology is the Disease Ontology (DO) 
that provides human disease terms, phenotype 
characteristics and related medical vocabulary 
disease concepts (Schriml et al. 2018). Another 
OBO ontology is the Chemical Entities of Bio- 
logical Interest (ChEBI) that provides a classifi- 
cation of molecular entities with biological inter- 
est with a focus on small chemical compounds 
(Degtyarenko et al. 2007). 


Popular Controlled Vocabularies 

Besides OBO ontologies, other popular con- 
trolled vocabularies also exist. One of them is the 
International Classification of Diseases (ICD), 
maintained by the World Health Organization 
(WHO). This vocabulary contains a list of 


2https://www.who.int/classifications/icd/en/ 
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generic clinical terms mainly arranged and clas- 
sified according to anatomy or etiology. Another 
example is the Systematized Nomenclature of 
Medicine — Clinical Terms (SNOMED CT)”, 
currently maintained and distributed by the 
International Health Terminology Standards 
Development Organization (IHTSDO). The 
SNOMED CT is a highly comprehensive and 
detailed set of clinical terms used in many 
biomedical systems. The Medical Subject Head- 
ings (MeSH)? is a comprehensive controlled 
vocabulary maintained by the National Library 
of Medicine (NLM) for classifying biomedical 
and health-related information and documents. 
Both MeSH and SNOMED CT are included 
in the Metathesaurus of the Unified Medical 
Language System (UMLS)”, maintained by the 
U.S National Library of Medicine. This is a large 
resource that integrates most of the available 
biomedical vocabularies. The 2015AB release 
covered more than three million concepts. 

Another alternative to BioPortal is Ontobee*”, 
a repository of ontologies used by most OBO 
ontologies, but it also includes many non-OBO 
ontologies. In December 2018, Ontobee provided 
access to 187 ontologies (Ong et al. 2016). 

Other alternatives outside the biomedical do- 
main include the list of vocabularies gathered by 
the W3C SWEO Linking Open Data community 
project?!, and by the W3C Library Linked Data 
Incubator Group”?. 


How? 
After finding the ontologies that cover our do- 


main of interest in the previous catalogs, a good 
idea is to find their home page and download the 


27https://digital.nhs.uk/services/terminology-and- 
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files from there. This way, we will be sure that 
we get the most recent release in the original 
format and select the subset of the ontology 
that really matter for our work. For example, 
ChEBI provides three versions: LITE, CORE and 
FULL?. Since we are interested in using the 
ontology just for text processing, we are probably 
not interested in chemical data and structures that 
is available in CORE. Thus, LITE is probably 
the best solution, and it will be the one we will 
use in this book. However, we may be missing 
synonyms that are only included in the FULL 
version. 


OWL 

The OWL language is the prevailing language 
to represent ontologies, and for that reason will 
be the format we will use in this book. OWL 
extends RDF Schema (RDFS) with more com- 
plex statements using description logic. RDFS is 
an extension of RDF with additional statements, 
such as class-subclass or property-subproperty 
relationships. RDF is a data model that stores in- 
formation in statements represented as triples of 
the form subject, predicate and object. Originally, 
W3C recommended RDF data to be encoded 
using Extensible Markup Language (XML) syn- 
tax, also named RDF/XML. XML is a self- 
descriptive mark-up language composed of data 
elements. 

For example, the following example repre- 
sents an XML file specifying that caffeine is a 
drug that may treat the condition of sleepiness, 
but without being an official treatment: 


«treatment category="non- 
official"> 
<drug>caffeine</drug> 
<condition>sleepiness</ 
condition> 
</treatment> 


The information is organized in an hierarchi- 
cal structure of data elements. treatment is 
the parent element of drug and condition. 
The character < means that a new data element 
is being specified, and the characters </ means 
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that a specification of data element will end. 
The treatment element has a property named 
category with the value non-official. 
The drug and condition elements have as 
values caffeine and sleepiness, respec- 
tively. This is a very simple XML example, 
but large XML files are almost unreadable by 
humans. 

To address this issue other encoding languages 
for RDF are now being used, such as N3% and 
Turtle?. Nevertheless, most biomedical ontolo- 
gies are available in OWL using XML encoding. 


URI 

The Uniform Resource Identifier (URI) was de- 
fined as the standard global identifier of classes in 
an ontology. For example, the class caffeine 
in ChEBI is identified by the following URI: 


http://purl.obolibrary.org/obo/ 
CHEBI 27732 


Ifa URI represents a link to a retrievable resource 
is considered a Uniform Resource Locator, or 
URL. In other words, a URI is a URL if we 
open it in a web browser and obtain a resource 
describing that class. 


34https://en.wikipedia.org/wiki/Notation3 
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Sometimes, ontologies are also available as 
database dumps. These dumps are normally SQL 
files that need to be fed to a DataBase Manage- 
ment System (DBMS)*°. If for any reason we 
must deal with these files, we can use the simple 
command line tool named sqlite3. The tool 
has the option to execute the SQL commands to 
import the data into a database (.read com- 
mand), and to export the data into a CSV file 
(. mode command) (Allen and Owens 2011). 


Further Reading 


One important read if we need to know more 
about biomedical resources is the Arthur Lesk's 
book about bioinformatics (Lesk 2014). The 
book has entire chapters dedicated to where data 
and text can be found, providing a comprehensive 
overview of the type of biomedical information 
available, nowadays. 

A more pragmatic approach is to explore the 
vast number of manuals, tutorials, seminars and 
courses provided by the EB and NCBI. 


3https://en.wikipedia.org/wiki/Database# 
Database_management_system 


37https://www.ebi.ac.uk/training 
38https://www.ncbi.nlm.nih.gov/home/learn/ 
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Abstract 

This chapter starts by introducing an example 
of how we can retrieve text, where every step 
is done manually. The chapter will describe 
step-by-step how we can automatize each step 
of the example using shell script commands, 
which will be introduced and explained as 
long as they are required. The goal is to equip 
the reader with a basic set of skills to retrieve 
data from any online database and follow the 
links to retrieve more information from other 
sources, such as literature. 


Keywords 

Unix shell - Terminal application - Web 
retrieval - CURL: Client Uniform Resource 
Locator - Data extraction - Data selection - 
Data filtering - Pattern matching - XML: 
extensible markup language - XPath: XML 
path language 


Caffeine Example 


As our main example, let us consider that we 
need to retrieve more data and literature about 
caffeine. If we really do not know anything 
about caffeine, we may start by opening our 
favorite internet browser and then searching 
caffeine in Wikipedia! to know what it really 


'https://en.wikipedia.org/wiki/Caffeine 


© The Author(s) 2019 


® 


Check for 
updates 


is (see Fig.3.1). From all the information that 
is available we can check in the infobox that 
there are multiple links to external sources. The 
infobox is normally a table added to the top 
right-hand part of a web page with structured 
data about the entity described on that page. 

From the list of identifiers (see Fig. 3.2), let 
us select the link to one resource hosted by the 
European Bioinfomatics Institute (EBI), the link 
to CHEBI:27732?. 

CHEBI represents the acronym of the 
resource Chemical Entities of Biological Interest 
(ChEBI? and 27732 the identifier of the entry in 
ChEBI describing caffeine (see Fig. 3.3). ChEBI 
is a freely available database of molecular entities 
with a focus on “small” chemical compounds. 
More than a simple database, ChEBI also 
includes an ontology that classifies the entities 
according to their structural and biological 
properties. 

By analyzing the CHEBI:27732 web page we 
can check that ChEBI provides a comprehensive 
set of information about this chemical compound. 
But let us focus on the Automatic Xrefs tab^. 
This tab provides a set of external links to other 


?https://www.ebi.ac.uk/chebi/searchId.do?%chebild= 
CHEBI:27732 


3http://www.ebi.ac.uk/chebi/ 


^http://www.ebi.ac.uk/chebi/display AutoXrefs.do? 
chebild=CHEB1:27732 
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WIKIPEDIA 
The Free Encyclopedia 


Main page 
Contents 

Featured content 
Current events 
Random article 
Donate to Wikipedia 
Wikipedia store 


Article Talk 


Caffeine 


From Wikipedia, the free encyclopedia 


3 Data Retrieval 


S not logged in Talk Contributions Create account Log in 


Read Viewsource View history |Search Wikipedia Q 


This article is about the stimulant drug. For other uses, see Caffeine (disambiguation). 


Caffeine is a central nervous system (CNS) stimulant of the 


Caffeine 


methylxanthine class.[10 it is the world's most widely consumed 
psychoactive drug. Unlike many other psychoactive substances, it is legal 
and unregulated in nearly all parts of the world. There are several known 
mechanisms of action to explain the effects of caffeine. The most 


nervous system. 
Community portal 


Recent changes Caffeine is a bitter, white crystalline purine, a methylxanthine alkaloid, 


and is chemically related to the adenine and guanine bases of 
Tools deoxyribonucleic acid (DNA) and ribonucleic acid (RNA). It is found in the 2 
seeds, nuts, or leaves of a number of plants native to Africa, East Asia 


prominent is that it reversibly blocks the action of adenosine on its E N N 
Inte t 
encore receptor and consequently prevents the onset of drowsiness induced by | 2 
js TES adenosine. Caffeine also stimulates certain portions of the autonomic P N 
out Wikipedia O N 


Contact page 


Fig. 3.1 Wikipedia page about caffeine 


1.2 Enhancing performance Excretion Urine (100%) 
1.3 Specific populations Identifiers 

2 Adverse effects IUPAC name 
2.1 Physical CAS Number 58-08-20 Y 
2.2 Psychological PubChem CID 25199 


2.3 Reinforcement disorders IUPHARIBPS 4072 

2.4 Risk of other diseases DrugBank DB002018 Y 

ChemSpider 24248 4 

UNII 3G6A5W338E 2 

KEGG D00528 Y 

ChEBI CHEBI:27732@ Y 

ChEMBL CHEMBL113g 4 

PDB ligand CFF (PDBe®, 
RCSB PDB®) 

ECHA InfoCard 100.000.3299 4 

Chemical and physical data 


3 Overdose 
4 Interactions 
4.1 Alcohol 
4.2 Tobacco 
4.3 Birth control 
4.4 Medications 
5 Pharmacology 
5.1 Pharmacodynamics 


5.2 Pharmacokinetics FORUMS CaHioN402 
ci Molar mass 194.19 g/mol 
6.1 Synthesis 3D model 
6.2 Decaffeination (JSmol) 
3 
6.3 Detection in body fluids Density 1.23 gicm 


Interactive image? 


Melting point 235 to 238 °C (455 to 
460 °F) (anhydrous)'"ll"! 
SMILES [show] 


6.4 Analogs 
6.5 Precipitation of tannins 
7 Natural occurrence 


Fig. 3.2 Identifiers section of the Wikipedia page about caffeine 


resources describing entities somehow related to the complete lis? (see Fig.3.5). These links 
caffeine (see Fig. 3.4). 

In the Protein Sequences section, we have 
77 proteins (in September of 2018) related to 
caffeine. If we click on show all we will get 


are to another resource hosted by the EBI, the 


Shttp://www.ebi.ac.uk/chebi/viewDbAutoXrefs.do? 
dbName=UniProt&chebild=27732 
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ChEBI > Main 


CHEBI:27732 - caffeine 


Main ChEBI Ontology Automatic Xrefs 


ChEBI Name 


ChEBI ID 
Definition 


Stars 
Secondary ChEBI IDs 
Supplier Information 


Download 


ChEBI Ontology Automatic Xrefs 


* Expandrelevant - Collapse all 


- Protein Sequences 


UniProt KB 
UniProt Knowledge Base of protein sequences. 
1. A2AGL3 
Ryanodine receptor 3 
. ASGE69 
7-methylxanthosine synthase 1 
. MGE70 
3,7-dimethylxanthine N-methyltransferase 
. AGMFK9 
Cysteine-rich venom protein 
. BOLPN4 


Ryanodine receptor 2 


~ Small molecules 


NMRShift0B 
NMRShiftDB is a NMR database for organic structures and 
their nuclear magnetic resonance (nmr) spectra. 

1. 10016316 


Fig. 3.4 External references related to caffeine 


19 


caffeine 
CHEBI:27732 


A trimethylxanthine in which the three methyl groups are located at positions 1, 3, and 7. A purine 
alkaloid that occurs naturally in tea and coffee. 


Www This entity has been manually annotated by the ChEBI Team 


CHEBI:3295, CHEBI:41472, CHEBI:22982 


- Reactions & Pathways 


BioModels 
Database of Mathematical models of biological interest. 
1. BIOMD0000000241 
Shi1993 Caffeine pressor tolerance 
2. BIOMD0000000601 
Rosas2015 - Caffeine-induced luminal SR calcium changes 


BKMS-react 3 
BKMS-react is an integrated and non-redundant biochemical 
reaction database containing known enzyme-catalyzed and spontaneous 
reactions. 

1. 887 

2. 


3. 512 


Rhea 
Rhea is a freely available, manually annotated database of 
biochemical reactions. 


UniProt, a database of protein sequences and DISRUPTION PHENOTYPE means some ef- 


annotation data. 


fects caused by the disruption of the gene coding 


The list includes the identifiers of each pro- for the protein are known. 
tein with a direct link to respective entry in 


UniProt, the name of the protein and some topics 


about the description of the protein. For example, “https://web.expasy.org/docs/userman.html#CC_line 


> C Q À tuepein Bioinformatics lestie [GB] | hetps//woww ebi ac uk 


UniProt Automatically Generated Cross-References 


Version 2014 02 of UniProt was used for these cross-references. 
77 entries found, displaying 1 to 15. 


Identifiers + Name Line Types 


3 Data Retrieval 


123456539 


Ryanodine receptor 3 CC - MISCELLANEOUS 


T-methylxanthosine synthase 1 CC - FUNCTION 
3,7-dimethylxanthine N-methyltransferase cc 


Cysteine-rich venom protein CC - FUNCTION 


* CATALYTIC ACTIVITY; CC - FUNCTION 


Ryanodine receptor 2 CC - MISCELLANEOUS 


Cysteine-rich venom protein CC - FUNCTION 
Cysteine-rich venom protein CC - FUNCTION 


Hadrucalcin CC - FUNCTION 


Caffeine dehydrogenase subunit alpha DE; FT, CC - CATALYTIC ACTIVITY; CC - FUNCTION; CC - BIOPHYSICOCHEMICAL PROPERTIES 


Caffeine dehydrogenase subunit beta DE; FT; CC - CATALYTIC ACTIVITY; CC - FUNCTION; CC - BIOPHYSICOCHEMICAL PROPERTIES 


Caffeine dehydrogenase subunit gamma DE; FT, CC - CATALYTIC ACTIVITY; CC - FUNCTION; CC - BIOPHYSICOCHEMICAL PROPERTIES 


Ryanodine receptor 1 cc 
Ryanodine receptor 2 cc- 


FOELKE Probable methylxanthine N7-demethylase NdmC CC - FUNCTION 


FILMY4  Ryanodine receptor 1 


Fig. 3.5 Proteins related to caffeine 


We should note that at bottom-right of the 
page there are Export options that enable us to 
download the full list of protein references in a 
single file. These options include: 


CSV: Comma Separated Values, the open for- 
mat file that enable us to store data as a 
single table format (columns and rows). 
a proprietary format designed to store 
and access the data using the software 
Microsoft Excel. 

eXtensible Markup Language, the open 
format file that enable us to store data 
using a hierarchy of markup tags. 


Excel: 


XML: 


We start by downloading the CSV, Excel and 
XML files. We can now open the files and check 
its contents in a regular text editor software" 
installed in our computer, such as notepad (Win- 
dows), TextEdit (Mac) or gedit (Linux). 

The first lines of the chebi 27732 xrefs | 
UniProt.csv file should look like this: 


A2AGL3,Ryanodine receptor 3,CC - 
MISCELLANEOUS 


Thttps://en.wikipedia.org/wiki/Text_editor 


> - MISCELLANEOUS 


MISCELLANEOUS 


C - MISCELLANEOUS 


Export options: ( 


A4GE69,7-methylxanthosine 
synthase 1,CC - FUNCTION 


The first lines of the chebi_27732_xrefs_ 
UniProt.xls file should look like this: 


"Identifiers" "Name" 
"Line Types" 

"A2AGL3" "Ryanodine 
receptor 3" "CC = 
MISCELLANEOUS" 

"AAGEG9" "7= 
methylxanthosine synthase 1" 
"CC - FUNCTION" 


As we can see, this is not the proprietary format 
XLS but instead a TSV format. Thus, the file can 
still be open directly on Microsoft Excel. 

The first lines of the chebi_27732_xrefs_ 
UniProt.xml file should look like this: 


<?xml version="1.0"?> 
<table> 
<row> 


<column>A2AGL3</column> 
<column>Ryanodine receptor 3</ 


column> 
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BLAST 
Uni 
Display 


Publications 


Protein | Ryanodine receptor 1 
Gene  RYR1 


Feature viewer Organism | Homo sapiens (Human) 


Feature table 


All None 


| |Names & Taxonomy 


G|pathol./Biotech 
€ By similarity ~ 
pam ER 


Function 


* 2 Publications ~ 


Fig. 3.6 UniProt entry describing the Ryanodine receptor 1 


<column>CC - MISCELLANEOUS</ 
column> 

</row> 

<row> 

<column>A4GE69</column> 

<column>7-methylxanthosine 
synthase 1</column> 

<column>CC - FUNCTION</column> 

</row> 


We should note that all the files contain the 
same data they only use a different format. 

If for any reason, we are not able to download 
the previous files from UniProt, we can get them 
from the book file archive?. 

In the following sections we will use these 
files to automatize this process, but for now let us 
continue our manual exercise using the internet 
browser. Let us select the Ryanodine receptor 1 
with the identifier P21817 and click on the link? 
(see Fig. 3.6). We can now see that UniProt is 


8http://labs.rd.ciencias.ulisboa.pt/book/ 
http://www.uniprot.ore/uniprot/P21817 


B - P21817 (RYR1_HUMAN) 


NA BLAST 3EAlgn G)Format Add to basket © History 


* Feedback E Help video E Other tutorials and videos 


Status | n Reviewed - Annotation score: @@@@® . Experimental evidence at protein level" 


Calcium channel that mediates the release of Ca ?* from the sarcoplasmic reticulum into the cytoplasm and thereby plays a key role in triggering 
muscle contraction following depolarization of T-tubules (PubMed: 11741831, PubMed:16163667). Repeated very high-level exercise increases the 
open probability of the channel and leads to Ca?* leaking into the cytoplasm (PubMed: 18268335). Can also mediate the release of Ca? from 
intracellular stores in neurons, and may thereby promote prolonged Ca?* signaling in the brain. Required for normal embryonic development of 
muscle fibers and skeletal muscle. Required for normal heart morphogenesis, skin development and ossification during embryogenesis (By similarity) 
$ 1 Publication ~ 


much more than just a sequence database. The se- 
quence is just a tiny fraction of all the information 
describing the protein. All this information can 
also be downloaded as a single file by clicking 
on Format and on XML. Then, save the result as 
a XML file to our computer. 

Again, we can use our text editor to open the 
downloaded file named P21817.xml, which first 
lines should look like this: 


<?xml version='1.0' 
UTF-8'?> 
<uniprot xmlns="http://uniprot. 
org/uniprot" xmlns:xsi="http: 
//www.w3 .org/2001/XMLSchema- 
instance" xsi:schemaLocation= 
"http://uniprot.org/uniprot 
http://www.uniprot.org/ 
support/docs/uniprot.xsd"> 
<entry dataset="Swiss-Prot" 
created="1991-05-01" modified 
="2018-06-20" version="210"> 
<accession>P21817</accession> 


encoding=' 


We can check that this entry represents a 
Homo sapiens (Human) protein, so if we are 
interested only in Human Proteins, we will have 
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ubl 
Display 
Entry 


Feature viewer 


Feature table 


Filter by 


Source 


n UniProtKB/Swiss- 
Prot (78) 


(reviewed) 


Computationally mapped 
(159) 


3 Data Retrieval 


Help Contact 


lated to P21817 - RYR1_HUMAN 


441102506237 P Show 25 v 


1. "Molecular cloning of cDNA encoding human and rabbit forms of the Ca2* release channel (ryanodine receptor) of skeletal muscle 


sarcoplasmic reticulum." 

Zorzato F., Fuji J., Otsu K., Phillips M.S., Green N.M., Lai F.A., Meissner G., Maclennan D.H. 
J. Biol. Chem. 265:2244-2256(1990) [PubMed] [Europe PMC] [Abstract] 

Cited for: NUCLEOTIDE SEQUENCE [MRNA] (ISOFORM 2), PARTIAL PROTEIN SEQUENCE 
Category: Sequences. 

Tissue; Skeletal muscle. 

Source: Y UniProtKB/Swiss-Prot (reviewed) 

This publication is cited by 1 and mapped to 9 other entries. 


2. "Polymorphisms and deduced amino acid substitutions in the coding sequence of the ryanodine receptor (RYR1) gene in individuals with 


malignant hyperthermia." 

Gillard E.F., Otsu K., Fujii J., Duff C.L., de Leon S,, Khanna V.K., Britt B.A., Worton R.G., McLennan D.H 

Genomics 13:1247-1254(1992) [PubMed] [Europe PMC] [Abstract] 

Cited for: SEQUENCE REVISION TO 2324; 2840 AND 3380, INVOLVEMENT IN MHS1, VARIANT MHS1 ARG-248, VARIANTS CYS-471; LEU-1787; 
CYS-2060 AND VAL-2550. 

Caregory: Pathology & Biotech, Sequences. 

Tissue: Muscle. 


Source: Ñ UniProtKB/Swiss-Prot (reviewed) 


rvanodine receptor aene associated with central core disease." 


Fig. 3.7 Publications related to Ryanodine receptor 1 


to filter them. For example, the entry E9PZQ0!° 
in the ChEBI list also represents a Ryanodine 
receptor 1 protein but for the Mus musculus 
(Mouse). 

Going back to the browser in the top-left side 
of the UniProt entry we have a link to publica- 
tions!!. If we click on it, we will see a list of 
publications somehow related to the protein (see 
Fig. 3.7). 

Let us assume that we are interested in find- 
ing phenotypic information, the first title that 
may attract our attention is: Polymorphisms and 
deduced amino acid substitutions in the coding 
sequence of the ryanodine receptor (RYRI) gene 
in individuals with malignant hyperthermia. To 
know more about the publication, we can use 
the UniProt citations service by clicking on the 
Abstract link!? (see Fig. 3.8). 

To check if the abstract mentions any dis- 
ease we can use an online text mining tool, for 
example the Minimal Named-Entity Recognizer 
(MER). We can copy and paste the abstract of 


!Ohttp://www.uniprot.org/uniprot/E9PZQO 

H https://www.uniprot.org/uniprot/P21817/publications 
Phttps://www.uniprot.org/citations/1354642 
'Shttp:/Mabs.rd.ciencias.ulisboa.pt/mer/ 


the publication into MER and select DO — Hu- 
man Disease Ontology as lexicon (see Fig. 3.9). 

We will see that MER detects three mentions 
of malignant hyperthermia, giving us another 
link!* about the disease found (see Fig. 3.10). 

Thus, in summary, we started from a generic 
definition of caffeine and ended with an abstract 
about hyperthermia by following the links in 
different databases. Of course, this does not mean 
that by taking caffeine we will get hyperthermia, 
or that we will treat hyperthermia by taking caf- 
feine (maybe as a cold drink !>). However, this 
relation has a context, a protein and a publication, 
that need to be further analyzed before drawing 
any conclusions. 

We should note that we only analyzed one 
protein and one publication, we now need to 
repeat all the steps to all the proteins and to 
all the publications related to each protein. And 
this could even be more complicated if we were 
interested in other central nervous system stim- 
ulants, for example by looking in the ChEBI 


^ http://purl.obolibrary.org/obo/DOID 8545 
Dhttps://en.wikipedia.org/wiki/HyperthermiastTreatment 


Caffeine Example 23 


w 
Help Contact 


Literature citation 


Map to E) Format 


UniProtKB (1) 


Tà Reviewed (1) Polymorphisms and deduced amino acid substitutions in the coding sequence of the 


Sweess-Prot 


ryanodine receptor (RYR1) gene in individuals with malignant hyperthermia. 


Gillard E.F., Otsu K., Fuji J., Duff C.L., de Leon S., Khanna V.K., Britt B.A., Worton R.G., McLennan D.H. 


Twenty-one polymorphic sequence variants of the RYR1 gene, including 13 restriction fragment length polymorphisms (RFLPs), were identified by 
sequence analysis of human ryanodine receptor (RYR1) cDNAs from three individuals predisposed to malignant hyperthermia (MH). All RFLPs were 
detectable in PCR-amplified products, and their segregation was consistent with our initial finding of linkage to MH in the nine families previously 
informative for one or more intragenic markers (MacLennan et al., 1990, Nature 343:559-561). Four amino acid substitutions were identified in the 
study: Arg for Gly248, Cys for Arg470, Leu for Pro1785, and Cys for Gly2059. Of 45 families tested, a single family presented the Arg for Gly248 
substitution where it segregated with malignant hyperthermia, making it a candidate mutation for predisposition to MH in man. The other three 
polymorphic substitutions failed to segregate with malignant hyperthermia in those families in which they occurred, implying that they represent 
polymorphisms with little or no effect on the function of the RYR1 gene. 


Genomics 13:1247-1254(1992) [PubMed] [Europe PMC] 


Fig. 3.8 Abstract of the publication entitled Polymorphisms and deduced amino acid substitutions in the coding 
sequence of the ryanodine receptor (RYR1) gene in individuals with malignant hyperthermia 


Submit 


Link 
malignant hyperthermia http;J/puri.obolibrary.org/obo/DOID 8545 
malignant hyperthermia http-//puri.obolibrary.org/obo/DOID. 8545 


malignant hyperthermia http-//puri.obolibrary.org/obo/DOID 8545 


Fig. 3.9 Diseases recognized by the online tool MER in an abstract 


ontology'®. This is of course the motivation to feasible to deal with such large amount of data, 
automatize the process, since it is not humanly that keeps evolving every day. 
However, if the goal was to find a relation be- 


Vépttps://www.ebi.ac.uk/chebi/chebiOntology.do? tween caffeine and hyperthermia, we could sim- 
chebild=35337 ply have searched these two terms in PubMed. 
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D Ontobee: DOID 


€ C €( © Notsecure 


_Ontobee 


Keywords: Search terms 
Class: malignant hyperthermia 


Term IRI: http://purl.obolibrary.org/obo/DOID 8545 


3 Data Retrieval 


Definition: A muscle tissue disease that is characterized by a drastic and uncontrolled increase in skeletal muscle oxidative metabolism, which overwhelms 
the body's capacity to supply oxygen, remove carbon dioxide, and regulate body temperature. [database cross reference M 


Fig. 3.10 Ontobee entry for the class malignant hyperthermia 


We did not do that because some relations are 
not explicitly mention in the text, thus we have 
to navigate through database links. The second 
reason is because we needed an example using 
different resources and multiple entries to explain 
how we can automate most of these steps using 
shell scripting. The automation of the example 
will introduce a comprehensive set of techniques 
and commands, which with some adaptation Life 
and Health specialists can use to address many of 
their text and data processing challenges. 


Unix Shell 


The first step is to open a shell in our personal 
computer. A shell is a software program that in- 
terprets and executes command lines given by the 
user in consecutive lines of text. A shell script is 
a list of such command lines. The command line 
usually starts by invoking a command line tool. 
This manuscript will introduce a few command 
line tools, which will allow us to automatize the 
previous example. Unix shell was developed to 
manage Unix-like operating systems, but due to 
their usefulness nowadays they are available is 
most personal computers using Linux, macOS 
or Windows operating systems. There are many 
types of Unix shells with minor differences be- 
tween them (e.g. sh, ksh, csh, tcsh and bash), but 
the most widely available is the Bourne-Again 
shell (bash!”). The examples in this manuscript 
were tested using bash. 


V https://en.wikipedia.org/wiki/Bash, (Unix, shell) 


So, the first step is to open a shell in our 
personal computer using a terminal application 
(see Fig. 3.11). If we are using Linux or macOS 
then this is usually not new for us, since most 
probably we have a terminal application already 
installed, that opens a shell for us. In case we 
are using a Microsoft Windows operating system, 
then we have several options to consider. If we 
are using Windows 10, then we can install a 
Windows Subsystem for Linux? or just install 
a third-party application, such as MobaXterm!”. 
No matter which terminal application we end up 
using, the shell will always have a common look: 
a text window with a cursor blinking waiting for 
our first command line. We should note that most 
terminal applications allow the usage of the up 
and down cursor keys to select, edit, and execute 
previous commands, and the usage of the tab key 
to complete the name of a command or a file. 


Current Directory 


As our first command line, we can type: 
$ pwd 


After hitting enter, the command will show the 
full path of the directory (folder) of our computer 
in which the shell is working on. The dollar 
sign in the left is only to indicate that this is a 
command to be executed directly in the shell. 


'8https://docs.microsoft.com/en-us/windows/wsl/about 
!https://mobaxterm.mobatek.net/ 


Unix Shell 
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Fig. 3.11 Screenshot of a Terminal application (Source: https://en.wikipedia.org/wiki/Unix) 


To understand a command line tool, such as 
pwd, we can type man followed by the name of 
the tool. For example, we can type man pwd to 
learn more about pwd (do not forget to hit enter, 
and press q to quit). We can also learn more about 
man by typing man man. A shorter alternative 
to man, is to add the --help option after any 
command tool. For example, we can type pwd 
--help to have a more concise description of 
pwd. 

As our second command line, we can type 1s 
and hit enter. It will show the list of files in the 
current directory. For example, we can type 1s 

--help to have a concise description of 1s. 
Since we will work with files, that we need to 
open with a text editor or a spreadsheet applica- 
tion??, such as LibreOffice Calc or Microsoft Ex- 
cel, we should select a current directory that we 
can easily open in our file explorer application. 
A good idea is to open our favorite file explorer 
application, select a directory, and then check its 
full path?! . 


0https://en.wikipedia.org/wiki/Spreadsheet 
2https://en.wikipedia.org/wiki/Path_(computing) 


Windows Directories 


Notice that in Windows the full path to a direc- 
tory each name is separated by a backslash (X) 
while in a Unix shell is a forward slash (/). 

For example, a Windows path to the Documents 
folder may look like: 


C: \Users\MyUserName\Documents 


If we are using the Windows Subsystem for 
Linux”, the previous folder must be accessed 
using the path: 


/mnt/c/Users/MyUserName/ 
Documents 


If we are using MobaXterm”, the following 
path should be used instead: 


/drives/c/Users/MyUserName/ 
Documents 


https: //www.howtogeek.com/261383/how-to-access- 
your-ubuntu-bash- files-in- windows-and-your- windows- 
system-drive-in-bash/ 

2 https://mobaxterm.mobatek net/documentation.html 
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Change Directory 


To change the directory, we can use another 
command line tool, the cd (change directory) 
followed by the new path. In a Linux system we 
may want to use the Documents directory. If the 
Documents directory is inside our current direc- 
tory (shown using 1s), we only need to type: 


s cd Documents 


Now we can type pwd to see what changed. 
And if we want to return to the parent direc- 
tory, we only need to use the two dots . .: 


$ cd 


And if we want to return to the home directory, 
we only need to use the tilde character (~): 


$ cd rv 


Again, we should type pwd to double check if we 
are in the directory we really want. 

In Windows we may need to use the full path, 
for example: 


s cd /mnt/c/Users/MyUserName/ 
Documents 


We should note that we need to enclose the 
path within single (or double) quotes in case it 
contains spaces: 


s cd '/mnt/c/Users/MyUserName/ 
Documents! 


Later on, we will know more about the difference 
between using single or double quotes. For now, 
we may assume that they are equivalent. To know 
more about cd, we can type cd --help. 


Useful Key Combinations 


Every time the terminal is blocked by any reason, 
we can press both the control and C key at the 
same time^^. This usually cancels the current tool 
being executed. For example, try using the cd 
command with only one single quote: 


$ cd." 


Z^ https:;//en.wikipedia.org/wiki/Control 
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This will block the terminal, because it is still 
waiting for a second single quote that closes the 
argument. Now press control-C, and the com- 
mand will be aborted. 

Now we can type again the previous com- 
mand, but instead of pressing control-C we may 
also press control-D*>. The combination control- 
D indicates the terminal that it is the end of input. 
So, in this case, the cd command will not be 
canceled, but instead it is executed without the 
second single quote and therefore a syntax error 
will be shown on our display. 

Other useful key combinations are the control- 
L that when pressed cleans the terminal display, 
and the control-insert and shift-insert that 
when pressed copy and paste the selected text, 
respectively. 


Shell Version 


The following examples will probably work in 
any Unix shell, but if we want to be certain that 
we are using bash we can type the following 
command, and check if the output says bash. 


$ ps -p $$ 

ps is a command line tool that shows in- 
formation about active processes running in our 
computer. The -p option selects a given process, 
and in this case $$ represents the process running 
in our terminal application. In most terminal 
applications bash is the default shell. If this is not 
our case, we may need to type bash, hit enter 
and now we are using bash. 

Now that we know how to use a shell, we can 
start writing and running a very simple script that 
reverse the order of the lines in a text file. 


Data File 


We start by creating a file named myfile.txt using 
any text editor, and adding the following lines: 


line 1 
line 2 


>Shttps://en.wikipedia.org/wiki/End- of- 
Transmission_character 
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line 3 
line 4 


We cannot forget to save it in our working di- 
rectory, and check if it has the proper filename 
extension. 


File Contents 


To check if the file is really on our working 
directory, we can type: 


$ cat myfile.txt 


The contents of the file should appear in our 
terminal. cat is a simple command line tool 
that receives a filename as argument and displays 
its contents on the screen. We can type man 
cat or cat --help to know more about this 
command line tool. 


Reverse File Contents 


An alternative to cat tool is the tac tool. To try 
it, we only need to type: 


$ tac myfile.txt 


The contents of the file should also appear in 
our terminal, but now in the reverse order. We can 
type man tac ortac --help to know more 
about this command line tool. 


My First Script 


Now we can create a script file named reverse- 
myfile.sh by using the text editor, and add the 
following lines: 


1 tac $1 


We cannot forget to save the file in our working 
directory. $1 represents the first argument after 
the script filename when invoking it. Each script 
file presented in this manuscript will include 
the line numbers in the left. This will helps us 
not only to identify how many lines the script 
contains, but also to distinguish a script file from 
the commands to be executed directly in the shell. 


27 
Line Breaks 


A Unix file represents a single line break by a line 
feed character, instead of two characters (carriage 
return and line feed) used by Windows”. So, if 
we are using a text editor in Windows, we must 
be careful to use one that lets us save it as Unix 
file, for example the open source Notepad++”. 

In case we do not have such text editor, we can 
also remove the extra carriage return by using the 
command line tool tr, that replaces and deletes 
characters: 


s tr -d '\r' < reversemyfile.sh 
» reversemyfilenew.sh 


The -d option of tr is used to remove a given 
character from the input, in this case tr will 
delete all carriage returns (Xx). Many command 
line options can be used in short form using a 
single dash (-), or in a long form using two 
dashes (- -). In this tool, using the --delete 
option is equivalent to the -d option. Long forms 
are more self-explanatory, but they take longer to 
type and occupy more space. We can type man 

tr ortr --help to know more about this 
command line tool. 


Redirection Operator 


The » character represents a redirection opera- 
tor? that moves the results being displayed at the 
standard output (our terminal) to a given file. The 
« character represents a redirection operator that 
works on the opposite direction, i.e. opens a given 
file and uses it as the standard input. 

We should note that cat received the filename 
as an input argument, while tr can only receive 
the contents of the file through the standard input. 
Instead of providing the filename as argument, 
the cat command can also receive the contents 
of a file through the standard input, and produce 
the same output: 


2https://en.wikipedia.org/wiki/Newline 
2 https://notepad-plus-plus.org/ 


28 https://www.gnu.org/software/bash/manual/html_node/ 
Redirections.html 
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$ cat < myfile.txt 


The previous tr command used a new file for 
the standard output, because we cannot use the 
same file to read and write at the same time. To 
keep the same filename, we have to move the new 
file by using the mv command: 


s mv reversemyfilenew.sh 
reversemyfile.sh 


We can type man mv or mv --help to know 
more about this command line tool. 


Installing Tools 


These two last commands could be replaced by 
the dos2unix tool: 


$ dos2unix -n reversemyfile.sh 


If not available, we have to install the dos2unix 
tool. For example, in the Ubuntu Windows Sub- 
system we need to execute: 


$ apt install dos2unix 


The apt (Advanced Package Tool) command 
is used to install packages in many Linux sys- 
tems”. Another popular alternative is the yum 
(Yellowdog Updater, Modified) command”. 

To avoid fixing line breaks each time we 
update our file when using Windows, a clearly 
better solution is to use a Unix friendly text 
editor. 

When we are not using Windows, or we are 
using a Unix friendly text editor, the previous 
commands will execute but nothing will happen 
to the contents of reversemyfile.sh, since the tr 
command will not remove any character. To see 
the command working replace '\r' by ' $' and 
check what happens. 


Permissions 


A script also needs permission to be executed, so 
every time we create a new script file we need to 


type: 


?https://en.wikipedia.org/wiki/APT. (Debian) 
30https://en.wikipedia.org/wiki/Yum_(software) 
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s chmod u+x reversemyfile.sh 


The command line tool chmod just gave the user 
(u) permissions to execute (+x). We can type 
man chmod or chmod --help to know more 
about this command line tool. 

Finally, we can execute the script by providing 
the myfile.txt as argument: 


s ./reversemyfile.sh myfile.txt 


The contents of the file should appear in our 
terminal in the reverse order: 


line 4 
line 3 
line 2 
line 1 


Congratulations, we made our first script work 
1 

If we give more arguments, they will be ig- 
nored: 


s ./reversemyfile.sh myfile.txt 
myotherfile.txt 'my other 
file.txt' 


The output will be exactly the same because our 
script does not use $2 and $3, that in this case 
will represent myotherfile.txt and my other file. txt, 
respectively. We should note that when contain- 
ing spaces, the argument must be enclosed by 
single quotes. 


Debug 
If something is not working well, we can debug 
the entire script by typing: 


s bash -x reversemyfile.sh 
myfile.txt 


Our terminal will not only display the result- 
ing text, but also the command line tools executed 
preceded by the plus character (+): 


+ tac myfile.txt 


line 4 
line 3 
line 2 
line 1 


Web Identifiers 


Alternatively, we can add the set -x command 
line in our script to start the debugging mode, and 
Set +x to stop it. 


Save Output 


We can now save the output into another file 
named mynewfile.txt by typing: 


s ./reversemyfile.sh myfile.txt 
» mynewfile.txt 


Again, to check if the file was really created, 
we can use the cat tool: 


$ cat mynewfile.txt 
Or, we can reverse it again by typing: 


s ./reversemyfile.sh mynewfile. 
txt 


Of course, the result should exactly be the origi- 
nal contents of myfile.txt. 


Web Identifiers 


The input argument(s) of our retrieval task is 
the chemical compound(s) of which we want 
to retrieve more information. For the sake of 
simplicity, we will start by assuming that the user 
knows the ChEBI identifier(s), i.e. the script does 
not have to search by the name of the compounds. 
Nevertheless, to find the identifier of a compound 
by its name is also possible, and this manuscript 
will describe how to do it later on. 

So, the first step, is to automatically retrieve 
all proteins associated to the given input chem- 
ical compound, that in our example was caf- 
feine (CHEBI:27732). In the manual process, we 
downloaded the files by manually clicking on the 
links shown as Export options, namely the URLs: 


https://www.ebi.ac.uk/chebi/ 
viewDbAutoXrefs.do?d-1169080- 
e-1&6578706£7274-1&chebiId 
-27732&dbName-UniProt 

https://www.ebi.ac.uk/chebi/ 
viewDbAutoXrefs.do?d-1169080- 
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e-2&6578706£7274-1&chebiId 
-27732&dbName-UniProt 
https://www.ebi.ac.uk/chebi/ 
viewDbAutoXrefs.do?d-1169080- 
e=3&6578706£7274=1&chebild 
=27732&dbName=UniProt 


for downloading a CSV, Excel, or XML file, 
respectively. 

We should note that the only difference be- 
tween the three URLs is a single numerical digit 
(1, 2, and 3) after the first equals character (=), 
which means that this digit can be used as an 
argument to select the type of file. Another pa- 
rameter that is easily observable is the ChEBI 
identifier (27732). Try to replace 27732 by 17245 
in any of those URLs by using a text editor, for 
example: 


https://www.ebi.ac.uk/chebi/ 
viewDbAutoXrefs.do?d-1169080- 
e-1&6578706f£7274-1&chebiId 
-17245&dbName-UniProt 


Now we can use this new URL in the inter- 
net browser, and check what happens. If we 
did it correctly, our browser downloaded a file 
with more than seven hundred proteins, since 
the 17245 is the ChEBI identifier of a popular 
chemical compound in life systems, the carbon 
monoxide. 

In this case, we are not using a fully RESTful 
web service, but the data path is pretty modular 
and self-explanatory. The path is clearly com- 
posed of: 


* the name of the database (chebi); 

* the method (viewDbAutoXrefs.do); 

* anda list of parameters and their value (argu- 
ments) after the question mark character (?). 


The order of the parameters in the URL is 
normally not relevant. They are separated by the 
ampersand character (&) and the equals character 
(=) is used to assign a value to each parameter (ar- 
gument). This modular structure of these URLs 
allows us to use them as data pipelines to fill our 
local files with data, like pipelines that transport 
oil or gas from one container to another. 
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Single and Double Quotes 


To construct the URL for a given ChEBI identi- 
fier, let us first understand the difference between 
single quotes and double quotes in a string (se- 
quence of characters). We can create a script file 
named getproteins.sh by using a text editor to add 
the following lines: 


1 echo 'The input: 
2 echo "The input: 


$1' 
si" 


The command line tool echo displays the string 
received as argument. Do not forget to save it in 
our working directory and add the right permis- 
sions with chmod as we did previously with our 
first script. 

Now to execute the script we will only need to 


type: 
$ ./getproteins.sh 


The output on the terminal should be: 


The input: $1 
The input: 


This means that when using single quotes, the 
string is interpreted literally as it is, whereas the 
string within double quotes is analyzed, and if 
there is a special character, such as the dollar sign 
($), the script translates it to what it represents. In 
this case, $1 represents the first input argument. 
Since no argument was given, the double quotes 
displays nothing. 

To execute the script with an argument, we can 
type: 
s ./getproteins.sh 27732 
The output on our terminal should be: 


The input: $1 
The input: 27732 


We can check now that when using double quotes 

$1 is translated to the string given as argument. 
Now we can update our script file named 

getproteins.sh to contain only the following line: 


1 echo "https://www.ebi.ac.uk/ 
chebi/viewDbAutoXrefs.do?d 
-1169080-e=1&6578706£7274 
=1&chebild=$1&dbName= 
UniProt" 
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Comments 


Instead of removing the previous lines, we can 
transform them in comments by adding the hash 
character (#) to the beginning of the line: 


1 #echo 'The input: $1' 

2 Hecho "The input: $1" 

3 echo "https://www.ebi.ac.uk/ 
chebi/viewDbAutoXrefs.do?d 
-1169080-e=1&6578706£7274 
-l&chebild-$1&dbName- 
UniProt" 


Commented lines are ignored by the computer 
when executing the script. 

Now, we can execute the script giving the 
ChEBI identifier as argument: 


s ./getproteins.sh 27732 


The output on our terminal should be the link 
that returns the CSV file containing the proteins 
associated with caffeine. 


Data Retrieval 


After having the link, we need a web retrieval tool 
that works like our internet browser, i.e. receives 
as input a URL for programmatic access and 
retrieves its contents from the internet. We will 
use Client Uniform Resource Locator (cURL), 
which is available as a command line tool, and 
allows us to download the result of opening a 
URL directly into a file (man curl or curl 
- -help for more information). 

For example, to display in our screen the list 
of proteins related to caffeine, we just need to add 
the respective URL as input argument: 


s curl 'https://www.ebi.ac.uk/ 
chebi/viewDbAutoXrefs.do?d 
-1169080-e=1&6578706£7274 
=1&chebild=27732&dbName= 
UniProt' 


In some systems the curl command needs to 
be installed*!. Since we are using a secure con- 


3lapt install curl 
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nection https, we may also need to install the ca- 
certificates package”. 

An alternative to curl is the command 
wget, which also receives a URL as argument 
but by default wget writes the contents to a 
file instead of displaying it on the screen (man 
wget or wget --help for more information). 
So, the equivalent command, is to add the -O- 
option to select where the contents is placed: 


s wget -O- 'https://www.ebi.ac. 
uk/chebi/viewDbAutoXrefs. 
do?d-1169080-e=1&6578706 
£7274-1&chebild-27732& 
dbName-UniProt ' 


We should note that dash - character after -O 
represents the standard output. The equivalent 
long form to the -O option is --output- 
document-file. 
The output on our terminal should be the long 
list of proteins: 


Q15413,Ryanodine receptor 3,CC - 
MISCELLANEOUS 

Q92375,Thioredoxin reductase,DE 

Q92736,Ryanodine receptor 2,CC - 
MISCELLANEOUS 


Instead of using a fixed URL, we can update 
the script named getproteins.sh to contain only 
the following line: 


1 curl "https://www.ebi.ac.uk/ 
chebi/viewDbAutoXrefs.do?d 
-1169080-e=1&6578706£7274 
=1&chebild=$1&dbName= 
UniProt" 


We should note that now we are using double 
quotes, since we replaced the caffeine identifier 
by $1. 

Now to execute the script we only need to 
provide a ChEBI identifier as input argument: 


s ./getproteins.sh 27732 


The output on our terminal should be the long list 
of proteins: 


apt install ca-certificates 
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Q15413,Ryanodine receptor 3,CC - 
MISCELLANEOUS 

Q92375,Thioredoxin reductase, DE 

Q92736,Ryanodine receptor 2,CC - 
MISCELLANEOUS 


Or, if we want the proteins related to carbon 
monoxide, we only need to replace the argument: 


$ ./getproteins.sh 17245 


And the output on our terminal should be an even 
longer list of proteins: 


Q58432, Phosphomethylpyrimidine 
synthase,CC - CATALYTIC 
ACTIVITY 

Q62976,Calcium-activated 
potassium channel subunit 
alpha-1,CC - ENZYME 
REGULATION; CC - DOMAIN 

Q63185,Eukaryotic translation 
initiation factor 2-alpha 
kinase 1,CC - ENZYME 
REGULATION 


If we want to analyze all the lines we can redi- 
rect the output to the command line tool less, 
which allows us to navigate through the output by 
using the arrow keys. To do that we can add the 
bar character ( |) between two commands, which 
will transfer the output of the first command as 
input of the second: 


s ./getproteins.sh 27732 | less 


To exit from less just press q. 

However, what we really want is to save the 
output as a file, not just printing some characters 
on the screen. Thus, what we should do is redirect 
the output to a CSV file. This can be done by 
adding the redirect operator » and the filename, 
as described previously: 


s ./getproteins.sh 27732 > 
chebi 27732 xrefs UniProt. 
csv 


We should note that curl still prints some 
progress information into the terminal. 
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Standard Error Output 


This happens because it is displaying that infor- 
mation into the standard error output, which was 
not redirected to the file?. The > character with- 
out any preceding number by default redirects the 
standard output. The same happens if we precede 
it by the number 1. If we do not want to see that 
information, we can also redirect the standard 
error output (2), but in this case to the null device 
(/dev/null): 


s ./getproteins.sh 27732 > 
chebi 27732 xrefs UniProt. 
csv 2>/dev/null 


We can also use the -s option of curl in 
order to suppress the progress information, by 
adding it to our script file named getproteins.sh: 


1 curl -s "https://www.ebi.ac.uk 
/chebi/viewDbAutoXrefs.do? 
d-1169080-e=1&6578706£7274 
=1&chebild=$1é&dbName= 
UniProt" 


The equivalent long form to the -s option is 
--silent. 

Now when executing the script, no progress 
information is shown: 


s ./getproteins.sh 27732 > 
chebi 27732 xrefs UniProt. 
csv 


To check if the file was really created and 
to analyze its contents, we can use the less 
command: 


s less chebi 27732 xrefs UniProt 
.CSv 


We can also open the file in our spreadsheet ap- 
plication, such as LibreOffice Calc or Microsoft 
Excel. 


33https://www.gnu.org/software/bash/manual/html_node/ 
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As an exercise execute the script to get the 
CSV file with the associated proteins of water** 
and gold??. 


Data Extraction 


Some data in the CSV file may not be relevant 
regarding our information need, i.e. we may need 
to identify and extract relevant data. In our case, 
we will select the relevant proteins (lines) us- 
ing the command line tool grep, and secondly, 
we will select the column we need using the 
command line tool gawk, which is the GNU 
implementation of awk*°. We should note that 
if we are using MobaXterm we may need to 
install the gawk package?". We can also replace 
gawk by awk in case another implementation is 
available?, 

Since our information need is about diseases 
related to caffeine, we may assume that we are 
only interested in proteins that have one of these 
topics in the third column: 


CC - MISCELLANEOUS 
CC - DISRUPTION PHENOTYPE 
CC - DISEASE 


Extracting lines from a text file is the main 
function of grep. The selection is performed by 
giving as input a pattern that grep tries to find 
in each line, presenting only the ones where it 
was able to find a match. The pattern is the same 
as the one we normally use when searching for 
a word in our text editor. The grep command 
also works with more complex patterns such as 
regular expressions, that we will describe later 
on. 


Mhttps;//www.ebi.ac.uk/chebi/searchId.do?chebiId- 
CHEBI:15377 


35https://www.ebi.ac.uk/chebi/searchId.do?chebild= 
CHEBI:30050 


3http://www.gnu.org/software/gawk/ 
apt install gawk 
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Single and Multiple Patterns 


We can execute the following command that 
selects the proteins with the topic CC - 
MISCELLANEOUS, our pattern, in our CSV 
file: 


s grep 'CC - MISCELLANEOUS' 
chebi 27732 xrefs UniProt. 
csv 


The output will be a shorter list of proteins, all 
with CC - MISCELLANEOUS as topic: 


A2AGL3,Ryanodine receptor 3,CC - 
MISCELLANEOUS 
BOLPN4,Ryanodine receptor 
MISCELLANEOUS 
E9PZQO,Ryanodine receptor 
MISCELLANEOUS 
E90401,Ryanodine receptor 
MISCELLANEOUS 
FILMY4,Ryanodine receptor 
MISCELLANEOUS 
P11716,Ryanodine receptor 
MISCELLANEOUS 
P21817,Ryanodine receptor 1,CC - 
DISEASE; CC - MISCELLANEOUS 
P54867,Protein SLG1,CC - 
MISCELLANEOUS 
Q9TS33,Ryanodine receptor 3,CC - 
MISCELLANEOUS 
Q15413,Ryanodine receptor 3,CC - 
MISCELLANEOUS 
Q92736,Ryanodine receptor 2,CC - 
MISCELLANEOUS 


To use multiple patterns, we must precede 
each pattern with the -e option: 


s grep -e 'CC - MISCELLANEOUS! - 
e 'CC - DISRUPTION 
PHENOTYPE! -e 'CC - 
DISEASE ' 
chebi 27732 xrefs UniProt. 
csv 


The equivalent long form to the -e option is - - 
regexp=PATTERN. 

The output on our terminal should be a longer 
list of proteins: 
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Q9VSH2,Gustatory receptor for 
bitter taste 66a,CC - 
FUNCTION; CC - DISRUPTION 
PHENOTYPE 

Q15413,Ryanodine receptor 3,CC - 

MISCELLANEOUS 

Q92736,Ryanodine receptor 2,CC - 

MISCELLANEOUS 


We should note that as previously, we can add 
| less to check all of them more carefully. 
The less command also gives the opportunity 
to find lines based on a pattern. We only need to 
type / and then a pattern. 

We can now update our script file named 
getproteins.sh to contain the following lines: 


1 curl -s "https://www.ebi.ac.uk 
/chebi/viewDbAutoXrefs.do? 
d-1169080-e-1&6578706£7274 
-l&chebild-$1&dbName- 
UniProt" | \ 

2 grep -e 'CC - MISCELLANEOUS! - 
e 'CC - DISRUPTION 
PHENOTYPE! -e 'CC - 
DISEASE! 


We should note that we added the -s option to 
suppress the progress information of curl, and 
the characters | \ to the end of line to redirect 
the output of that line as input of the next line, 
in this case the grep command. We need to be 
careful in ensuring that \ is the last character in 
the line, i.e. spaces in the end of the line may 
cause problems. 
We can now execute the script again: 


s ./getproteins.sh 27732 


The output should be similar of what we got 
previously, but the script downloads the data and 
filters immediately. 

To save the file with the relevant proteins, we 
only need to add the redirection operator: 


s ./getproteins.sh 27732 > 
chebi 27732 xrefs UniProt 
_relevant.csv 
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Data Elements Selection 


Now we need to select just the first column, 
the one that contains the protein identifiers. Se- 
lecting columns from a tabular file is one easy 
task for gawk, that besides performing pattern 
scanning also provides a complex processing lan- 
guage (AWK?). This processing language can 
be highly complex* and it is out of our scope 
for this introductory manuscript. The gawk com- 
mand can receive as arguments the character that 
divides each data element (column) in a line 
using the -F option, and an instruction of what 
to do with it enclosed by single quotes and curly 
brackets. The equivalent long form to the -F 
option is --field-separator=fs. 

For example, we can get the first column of 
our CSV file: 


s gawk -F, '{ print $1 }' < 
chebi 27732 xrefs UniProt 
relevant.csv 


We should note that comma (, ) is the character 
that separates data elements in a CSV file, and 
that print is equivalent to echo, and $1 repre- 
sents the first data element. 

The command will display only the first col- 
umn of the file, i.e. the protein identifiers: 


O9VSH2 
015413 
092736 


For example, we can get the first and third 
columns separated by a comma: 


s gawk -F, '{ print $1 ", " $3]' 
« chebi 27732 xrefs 


UniProt relevant.csv 


Now, the output contains both the first and 
third column of the file: 


O9VSH2, CC - FUNCTION; CC - 
DISRUPTION PHENOTYPE 
015413, CC - MISCELLANEOUS 


39https://en.wikipedia.org/wiki/AWK 
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092736, CC - MISCELLANEOUS 


We can update our script file named getpro- 
teins.sh to contain the following lines: 


1 curl -s "https://www.ebi.ac.uk 
/chebi/viewDbAutoXrefs.do? 
d-1169080-e=1&6578706£7274 
-1&chebild-$1&dbName- 


UniProt" | \ 

2 grep -e 'CC - MISCELLANEOUS! - 
e 'CC - DISRUPTION 
PHENOTYPE! -e 'CC - 
DISEASE' | \ 

3 gawk -F, '( print $1 ]' 


The last line is the only that changes, except the 
| Vin the previous line to redirect the output. 
To execute the script, we can type again: 


s ./getproteins.sh 27732 


The output should be similar of what we got 
previously, but now only the protein identifiers 
are displayed. 

To save the output as a file with the relevant 
proteins’ identifiers, we only need to add the 
redirection operator: 


s ./getproteins.sh 27732 > 
chebi 27732 xrefs UniProt 
relevant identifiers.csv 


Task Repetition 


Given a protein identifier we can construct the 
URL that will enable us to download its infor- 
mation from UniProt. We can use the REST- 
ful web services provided by UniProt*!, more 
specifically the one that allow us to retrieve a 
specific entry?. The construction of the URL 
is simple, it starts always by https://www 
.uniprot.org/uniprot/, followed by the 
protein identifier, ending with a dot and the data 
format. For example, the link for protein P21817 
using the XML format is: http://www.uniprot. 
org/uniprot/P21817.xml 


^l https://www.uniprot.org/help/api 
Zhttps://www.uniprot.org/help/api retrieve entries 


Task Repetition 
Assembly Line 


However, we need to construct one URL for each 
protein from the list we previously retrieved. The 
size of the list can be large (hundreds of proteins), 
varies for different compounds and evolves with 
time. Thus, we need an assembly line in which 
a list of proteins identifiers, independently of 
its size, are added as input to commands that 
construct one URL for each protein and retrieve 
the respective file. 

The xargs command line tool works as an 
assembly line, it executes a command per each 
line given as input. We should note that if we 
are using MobaXterm we may need to install the 
findutils package? , since the default xargs only 
has minimal options“. 

We can start by experimenting the xargs 
command by giving as input the list of protein 
identifiers in file chebi_27732_xrefs_UniProt 
relevant_identifiers.csv, display each identifier 
on the screen in the middle of a text message by 


providing the echo command as argument: 


s cat chebi 27732 xrefs UniProt 
relevant identifiers.csv 
| xargs -I {} echo ' 
Another protein id {} to 
retrieve' 


The xargs command received as input the con- 
tents our CSV file, and for each line displayed 
a message including the identifier in that line. 
The -I option tells xargs to replace [ ) in the 
command line given as argument by the value 
of the line being processed. The equivalent long 
form to the -I option is - -replace=R. 
The output should be something like this: 


Another protein id A2AGL3 to 
retrieve 

Another protein id BOLPN4 to 
retrieve 


Bapt install findutils 


^^In some versions the scripts may have to use xargs 
.exe to invoke the new version. Or rename the xargs 
shortcut in the bin folder to other name, that way the right 
version will always be invoked. 
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Another protein id E9PZQO to 
retrieve 


Instead of creating inconsequential text mes- 
sages, we can use xargs to create the URLs: 


s cat chebi 27732 xrefs UniProt 
relevant identifiers.csv 
| xargs -I {} echo 'https 
://www.uniprot.org/uniprot 


/().xml' 
The output should be something like this: 


https://www.uniprot.org/uniprot/ 
A2AGL3.xml 

https://www.uniprot.org/uniprot/ 
BOLPN4 . xml 

https://www.uniprot.org/uniprot/ 
E9PZQO.xml 


We can try to use these links in our internet 
browser to check if those displayed URLs are 
working correctly. 

Now that we have the URLs, we can au- 
tomatically download the files using the cur1 
command instead of echo: 


s cat chebi 27732 xrefs UniProt 
relevant identifiers.csv 
| xargs -I {} curl 'https 
://www.uniprot.org/uniprot 
/{}.xml' -o 'chebi 27732 
().xml' 


We should note that we now use the - o option to 
save the output to a given file, named after each 
protein identifier. The equivalent long form to the 
-ooptionis --output <file>. 

To check if everything worked as expected we 
can use the 1s command to view which files were 
created: 


$ ls chebi 27732 ».xml 


The asterisk character (*) character is here used 
to represent any file whose name starts with 
chebi 27732 andends with . xml. 

To check the contents of any of them, we can 
use the 1ess command: 


$ less chebi 27732 P21817.xml 
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File Header 


We should note that the content of every file 
has to start with <?xml otherwise there was a 
download error, and we have to run curl again 
for those entries. To check the header of each file, 
we can use the head command together with 
less. 


$ head -n 1 chebi 27732 x.xml | 
less 


The -n option specifies how many lines to print, 
in the previous command just one. 

If for any reason, we are not able to download 
the files from UniProt, we can get them from the 
book file archive“. 


Variable 


We can now update our script file named getpro- 
teins.sh to contain the following lines: 


i ID-$1 4 The CHEBI identifier 
given as input is renamed 
to ID 

2 rm -f chebi\ SIDN ».xml # 
Removes any previous files 

3 curl -s "https://www.ebi.ac.uk 
/chebi/viewDbAutoXrefs.do? 
d-1169080-e=1&6578706£7274 
=1&chebild=$ID&dbName= 


UniProt" \ 

4 grep -e 'CC - MISCELLANEOUS! - 
e 'CC - DISRUPTION 
PHENOTYPE! -e 'CC - 
DISEASE' N 

s gawk -F, '{ print $1 }' | 


xargs -I {} curl 'https:// 
www.uniprot.org/uniprot 
/{}.xml' -o chebi\ SID\_ 
{}.xml 


We should note that the last line now includes 
the xargs and curl commands, and the $ID 
variable. This new variable is created in the first 
line to contain the first value given as argument 


http://labs.rd.ciencias.ulisboa.pt/book/ 
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($1). So, every time we mention $ID in the 
script we are mentioning the first value given as 
argument. This avoids ambiguity in cases where 
$1 is used for other purposes, like in the gawk 
command. Since the preceding character of $ID 
is an underscore (. ) , we have to add a backslash 
(\) before it. The second line uses the rm com- 
mand to remove any files that were downloaded 
in a previous execution. We also now added two 
comments after the hash character, so we humans 
do not forget why these commands are needed 
for. 
To execute the script once more: 


s ./getproteins.sh 27732 
And again, to check the results: 


$ head -n 1 chebi_27732 *.xml | 
less 


XML Processing 


Assuming that our information need only con- 
cerns human diseases, we have to process the 
XML file of each protein to check if it represents 
a Homo sapiens (Human) protein. 


Human Proteins 


For performing this filter, we can again use the 
grep command, to select only the lines of any 
XML file that specify the organism as Homo 
sapiens: 


$s grep '<name type="scientific"> 
Homo sapiens</name>' 
chebi 27732 x.xml 


We should get in our display the filenames that 
represent a human protein, i.e. something like 
this: 


chebi 27732 P21817.xml:«name 
type="scientific">Homo 
sapiens</name> 

chebi 27732 Q15413.xml:«name 
type="scientific">Homo 
sapiens</name> 


XML Processing 


chebi 27732 Q8N490.xml:<name 
type="scientific">Homo 
sapiens</name> 

chebi 27732 Q92736.xml:«name 
type="scientific">Homo 
sapiens</name> 


We should note that since the asterisk character 
(x) provides multiple files as argument to 
grep, the ones whose name starts with 
chebi 27732 and ends with .xml, the 
output now includes the filename (followed by a 
colon) where each line was matched. 

We can use the gawk command to extract 
only the filename, but grep has the -1 option 
to just print the filename: 


$ grep -l '<name type=" 
scientific">Homo sapiens</ 


name>' chebi 27732 x.xml 


The equivalent long form to the - 1 option is - - 
files-with-matches. 

The output will now show only the filenames: 
chebi 27732 P21817.xml 
chebi 27732 Q15413. 
chebi 27732 Q8N490 
chebi 27732 Q92736 


xml 
.xml 


.xml 


These four files represent the four Human pro- 
teins related to caffeine. 


PubMed Identifiers 


Now we need to extract the PubMed identifiers 
from these files to retrieve the related publica- 
tions. For example, if we execute the following 
command: 


s grep '«dbReference type=" 
PubMed"! 
chebi 27732 P21817.xml 


The output is a long list of publications related 
to protein P21817: 


id= " 


«dbReference type="PubMed" 
2298749"/> 
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<dbReference type="PubMed" id=" 
1354642"/> 

<dbReference type="PubMed" id=" 
8220422"/> 

<dbReference type="PubMed" id=" 
8661021"/> 

<dbReference type="PubMed" id=" 


15057824"/> 


To extract just the identifier, we can again use 
the gawk command: 


s grep '<dbReference type=" 
PubMed" ' 
chebi 27732 P21817.xml | 
gawk -F\" '( print $4 }! 


We should note that " is used as the separa- 
tion character and, since the PubMed identifier 
appears after the third ", the $4 represents the 
identifier. 

Now the output should be something like this: 


2298749 
1354642 
8220422 
8661021 
15057824 


PubMed Identifiers Extraction 


Now to apply to every protein we may again use 
the xargs command: 


$ grep -l '«name type=" 
scientific">Homo sapiens</ 
name>! chebi 27732 ».xml | 
xargs -I {} grep '< 
dbReference type-"PubMed"' 

{} | gawk -F\" '( print 

$4 J' 

This may provide a long list of PubMed identi- 


fiers, including repetitions since the same publi- 
cation can be cited in different entries. 
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Duplicate Removal 


To help us identify the repetitions, we can add 
the sort command (man sort or sort -- 
help for more information), which will display 
the repeated identifiers in consecutive lines (due 
by sorting all identifiers): 


$ grep -l '«name type=" 
scientific">Homo sapiens</ 
name>! chebi_27732_*.xml | 
xargs -I {} grep '« 
dbReference type-"PubMed"' 
{} | gawk -F\" '( print 
$4 )' | sort | less 


For example some repeated PubMed identi- 
fiers that we should easily be able to see: 


10051009 
10051009 
10097181 
10097181 
10484775 
10484775 


Fortunately, we also have the -u option that 
removes all these duplicates: 


$ grep -l '«name type=" 
scientific">Homo sapiens</ 
name>! chebi_27732_*.xml | 
xargs -I {} grep '« 
dbReference type-"PubMed"' 
() | gawk -FN" '( print 
$4 )' | sort -u 


To easily check how many duplicates were re- 
moved, we can use the word count wc command 
with and without the usage of the -u option: 


$ grep -l '«name type=" 
scientific">Homo sapiens</ 
name>! chebi_27732_*.xml | 
xargs -I {} grep '« 
dbReference type-"PubMed"' 
{} | gawk -F\" '( print 
$4 }' | sort | wc 
s grep -l '«name type=" 
scientific">Homo sapiens</ 
name>! chebi_27732_*.xml | 
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xargs -I {} grep '« 
dbReference type-"PubMed"' 
{} | gawk -F\" '( print 
$4 }' | sort -u | wc 


In case we have in our folder any auxiliary 
file, such as chebi 27732 P21817 entry 
. xml, we should add the option - -exclude 
xentry.xml to the first grep command. 

The output should be something like: 


255 255 2243 
129 129 1136 


wc prints the numbers of lines, words, and 
bytes, thus in our case we are interested in first 
number (man wc or wc --help for more in- 
formation). We can see that we have removed 
255 — 129 — 126 duplicates. 

Just for curiosity, we can also use the shell to 
perform simple mathematical calculations using 
the expr command: 


$ expr 255 - 129 


Now let us create a script file named get- 
publications.sh by using a text editor to add the 
following lines: 


i ID-$1 # The CHEBI identifier 
given as input is renamed 
to ID 
2 grep -l '«name type=" 
scientific">Homo sapiens</ 
name>! chebi\_SID\_*.xml | 
\ 
3 xargs -I {} grep '«dbReference 
type-"PubMed"' {} | \ 
4 gawk -F\" '{ print $4 j' | 


sort -u 


Again, do not forget to save it in our working 
directory, and add the right permissions with 
chmod as we did previously with the other 
scripts. 

To execute the script again: 


s ./getpublications.sh 27732 


We can verify how many unique publications 
were obtained by using the - 1 option of wc, that 
provides only the number of lines: 


XML Processing 


s ./getpublications.sh 27732 | 
wc -1 


The output will be 129 as expected. 


Complex Elements 


Not always the XML elements are in the same 
line, as fortunately was the case of the PubMed 
identifiers. In those cases, we may have to use 
the xmllint command, a parser that is able to 
extract data through the specification of a XPath 
query, instead of using a single line pattern as in 


grep. 


XPath 


XPath (XML Path Language) is a powerful tool 
to extract information from XML and HTML 
documents by following their hierarchical struc- 
ture. Check W3C for more about XPath syntax ^6. 
We should note that xmllint may not be in- 
stalled by default depending on our operating 
system, but it should be very easy to do it^" If 
we are using MobaXterm, then we need to install 
the xmllint plugin’. 


Namespace Problems 


In the case of our protein XML files, we can see 
that their second line defines a specific names- 
pace using the xmlns attribute: 


«uniprot xmlns-"http://uniprot. 
org/uniprot" xmlns:xsi-"http: 
/ / www .w3.o0rg/2001/XMLSchema- 
instance" xsi:schemaLocation- 
"http://uniprot.org/uniprot 
http://www.uniprot.org/ 
support/docs/uniprot.xsd"> 


46https://www.w3schools.com/xml/xpath_syntax.asp 
"apt install libxml2-utils 

48 https://mobaxterm.mobatek.net/plugins.html 

P https://www.w3schools.com/xml/xml. namespaces.asp 
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This complicates our XPath queries, since we 
need to explicitly specify that we are using the 
local name for every element in a XPath query. 
For example, to get the data in each re£erence 
element: 


s xmllint --xpath "//x*[local- 
name ()='reference']" 
chebi 27732 P21817.xml 


We should note that // means any path in the 
XML file until reaching a reference element. 
The square brackets in XPath queries normally 
represent conditions that need to be verified. 


Only Local Names 


If we are only interested in using local names 
there is a way to avoid the usage of local 
-name () for every element in a XPath 
query. We can identify the top-level element, 
in our case entry, and extract all the data 
that it encloses using a XPath query. For 
example, we can create the auxiliary file 
chebi 27732 P21817 entry.xml by 
adding the redirection operator: 


s xmllint --xpath "//x*[local- 
name()='entry']" 
chebi 27732 P21817.xml > 
chebi 27732 P21817 entry. 
xml 


The new XML file now starts and ends with 
the ent ry element without any namespace defi- 
nition: 


«entry dataset="Swiss-Prot" 
created="1991-05-01" modified 
="2018-09-12" version="211"> 

<accession>P21817</accession> 


</sequence> 
</entry> 


Now we can apply any XPath query, for exam- 
ple //reference, on the auxiliary file without 
the need to explicitly say that it represents a local 
name: 
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s  xmllint --xpath '//reference  * /entry/reference/citation/dbRe 
' ference- equivalent to the previous query 
chebi 27732 P21817 entry. but specifying the full path in the XML file; 

e //dbReference/ x —any child elements of 

a dbReference element; Result: 


xml 


The output should contain only the data inside 


of each reference element: «property type-"protein 
sequence ID" value=" 

srererencs keye ls AAA60294.1"/» ... «property 

«citation type="journal article" type="match status" value= 


date="1990" name="J. Biol. "5"/. 
Chem." volume-"265" first=" 
2244" last="2256"> 

<title>Molecular cloning of cDNA 
encoding human and rabbit 


* //dbReference/property[1] - first 
property element of each dbReference 
element; Result: 


forms of the Ca2+ release «property type-"protein 
channel (ryanodine receptor) sequence ID" value-" 

of skeletal muscle AAA60294.1"/» ... «property 
sarcoplasmic reticulum.</ type-"entry name" value=" 
title> MIR"/> 


* //dbReference/property[2] - sec- 
ond property element of each dbReference 
element; Result: 


<dbReference type="DOI" id=" 
10.1111/cge.12810"/> 


</citation> 

<scope>VARIANTS CCD PRO-2963 AND <property type="molecule type" 
ASP-4806</scope> value="mRNA"/> 

</reference> <property type="match 


status" value="5"/> 


* //dbReference/property [3] - third 
Queries property element of each dbReference 
element; Result: 


The XPath syntax allow us to create many useful «property type-"molecule type" 


queries, such as: value-"Genomic DNA"/» 
«property type-"project" 
e //dbReference - elements of type value="UniProtKB"/> 


dbReference that are descendants of 


something; Result: * //dbReference/property/Gtype - 


all type attributes of the property elements; 
«dbReference type-"NCBI Result: 


Taxonomy" id="9606"/> 


type="protein sequence ID" 
type="molecule type" type=" 
protein sequence ID" 
type-"entry name" type=" 

* /entry//dbReference - equivalent to match status" 

the previous query but specifying that the 
dbReference elements are descendants of 


«dbReference type="PubMed" id= 
"27586648"/> 


//dbReference/property [@type=" 
the ent ry element; protein sequence ID"] the previous 
, property elements that have an attribute 


type equal to protein sequence ID; Result: 
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<property type="protein 
sequence ID" value=" 
AAA60294.1"/» «property 
type="protein sequence ID" 
value="ENSP00000352608"/> 


* //dbReference/property [@type=" 
protein sequence ID"]/@value - 
the string assigned to each attribute value of 
the previous property elements; Result: 


value="AAA60294.1" value=" 
AAC51191.1" value=" 
ENSP00000352608" 


* //sequence/text() - the contents in- 
side the sequence elements; Result: 


MGDAEGEDEVOFLRTDDEVVLOCSATVLKEOLKLC 
LAAEGFGNRLCFLEPTSNAONVPPD 


LEEHNLANYMFFLMYLINKDETEHTGOESYVWKMY 
OERCWDFFPAGDCFRKOYEDOLS 


We should note that to try the previous queries 
we only need to replace the string after the - - 
xpath option of the previous xmllint com- 
mand, such as: 


s xmllint --xpath '//dbReference 
' chebi 27732 P21817 entry 
.xml 


Thus, an alternative way to extract the 
PubMed identifiers using xmllint instead of 
grep, would be something like this: 


s xmllint --xpath '//dbReference 
[@type="PubMed"] /aid' 


s chebi 27732 P21817 entry.xml 


However, the output contains all identifiers in 
the same line and with the id label: 


id="2298749" id="1354642" id=" 


8220422" 


Extracting XPath Results 


To extract the identifiers, we need to apply the tr 
command to split the output in multiple lines (one 
line per identifier), and then the gawk command: 
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s xmllint --xpath '//dbReference 
[@type="PubMed"] /@id' 
chebi 27732 P21817 entry. 
xml | tr ' ' '\n' | gawk - 
EX" '{ NF 50 ; print $2 ]' 


The tr command replaces each space by a new- 
line character, and the gawk command extracts 
the value inside the double quotes. We should 
note that NF »0 is used to only select lines with 
at least a separation character ", i.e. in our case it 
ignores empty lines. 


Text Retrieval 


Now that we have all the PubMed identifiers, we 
need to download the text included in the titles 
and abstracts of each publication. 


Publication URL 


To retrieve from the UniProt citations service the 
publication entry of a given identifier, we can 
again use the curl command and a link to the 
publication entry. For example, if we click on 
the Format button of the UniProt citations service 
entry??, we can get the link to the RDF/XML ver- 
sion. RDF?! is a standard data model that can be 
serialized in a XML format. Thus, in our case, we 
can deal with this format like we did with XML. 

We can retrieve the publication entry by exe- 
cuting the following command: 


s curl https://www.uniprot.org/ 
citations/1354642.rdf 


Thus, we can now update the script getpubli- 
cations.sh to have the following commands: 


i ID-$1 # The CHEBI identifier 
given as input is renamed 
to ID 

2 rm -f chebi\ SIDA «.rdf # 
Removes any previous files 


SOhttps://www.uniprot.org/citations/1354642 
5! https://www.w3.org/RDF/ 
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3 grep -l '<name type=" 
scientific">Homo sapiens</ 
name>! chebi\ SID\_*.xml | 

\ 

4 xargs -I {} grep '«dbReference 

type="PubMed"' {} | \ 

s gawk -F\" '{ print $4 j' | 
sort -u | \ 

6 xargs -I {} curl 'https://www. 


7 uniprot.org/citations/(]. 
rat" 
8 -o chebi\_SID\_{}.rdf 


We should note that only the second and last lines 
were updated to remove and retrieve the files, 
respectively. 

Now let us execute the script: 


s ./getpublications.sh 27732 


It may take a while to download all the entries, 
but probably no more than one minute with a 
standard internet connection. 

To check if everything worked as expected we 
can use the 1s command to view which files were 
created: 


$ ls chebi 27732 x.rdf 


If for any reason, we are not able to download 
the abstracts from UniProt, we can get them from 
the book file archive”. 


Title and Abstract 


Each file has the title and abstract of the 
publication as values of the title and 
rdfs:comment elements, respectively. To 
extract them we can again use the grep 
command: 


$ grep -e '<title>' -e '«rdfs: 
comments»! 


chebi 27732 1354642.rdf 


The output should be something like these two 
lines: 


<title>Polymorphisms 
hyperthermia.</title> 


*http://abs.rd.ciencias.ulisboa.pt/book/ 
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<rdfs:comment >Twenty-one 
gene.</rdfs:comment> 


To remove the XML elements, we can again 
use gawk: 


$ grep -e '<title>' -e '«rdfs: 
comment >' 
chebi 27732 1354642.rdf | 
gawk -F'[«»]' '( print $3 
y 
We should note that we now use two characters 
as field separators < and > to get the text between 
the first > and the second <. The first field 
separator is < so $2 contains the string title or 
rdfs: comment while $1 is empty. The second 
field separator is > so $3 contains the string we 
want to keep. 
The output should now be free of XML ele- 
ments: 


Polymorphisms 
Twenty-one 


hyperthermia. 
gene. 


Thus, let us create the script gettext.sh to have 
the following commands: 


1 ID=$1 # The CHEBI identifier 
given as input is renamed 
to ID 


1 grep -e '«title»' -e '«rdfs: 
comment>' chebi\ SIDN x. 
rdf | \ 


2 gawk -F'[<>]' '{ print $3 )' 


Again do not forget to save it in our working 
directory, and add the right permissions. 

Now to execute the script and see the retrieved 
text: 


s ./gettext.sh 27732 | less 


We can save the resulting text in a file named 
chebi_27732.txt that we may share or read using 
our favorite text editor, by adding the redirection 
operator: 


s ./gettext.sh 27732 > 
chebi 27732.txt 


Further Reading 
Disease Recognition 


Instead of reading all that text to find any disease 
related with caffeine, we can try to find sentences 
about a given disease by using grep: 


¢ grep 'malignant hyperthermia' 
chebi 27732.txt 


To save the filtered text in a file named 
chebi 27732 hyperthermia.txt, we only need 
to add the redirection operator: 


¿ grep 'malignant hyperthermia' 
chebi 27732.txt » 
chebi 27732 hyperthermia. 
txt 
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This is a very simple way of recognizing a 
disease in text. The next chapters will describe 
how to perform more complex text processing 
tasks. 


Further Reading 


If we really want to become an expert in shell 
scripting we may be interested in reading a book 
specialized in the subject, such as the book enti- 
tled The Linux command line: a complete intro- 
duction (Shotts Jr 2012). 

A more pragmatic approach is to explore the 
vast number of online tutorials about shell script- 
ing and web technologies, such as the ones pro- 
vided by W3Schools?". 


https://www.w3schools.com/ 


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License 
(http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in 
any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to 
the Creative Commons licence and indicate if changes were made. 

The images or other third party material in this chapter are included in the chapter’s Creative Commons licence, 
unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons 
licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to 
obtain permission directly from the copyright holder. 


Abstract 

In the previous chapter we were able to 
automatically process structured data to 
retrieve biomedical text about any chemical 
compound, such as caffeine. This chapter will 
provide a step-by-step introduction to how 
we can process that text using shell script 
commands, specifically extract information 
about diseases related to caffeine. The goal 
is to equip the reader with an essential set of 
skills to extract meaningful information from 
any text. 


Keywords 

NLP: Natural Language Processing - Text 
mining - Pattern matching - String matching - 
Word matching - Evaluation metrics - 
Regular expressions - Tokenization - NER: 
Named-Entity Recognition - Relation 
extraction 


In the previous chapter we were able to automat- 
ically process structured data to retrieve biomed- 
ical text about any chemical compound, such 
as caffeine. This chapter will provide a step-by- 
step introduction to how we can process that text 
using shell script commands, specifically extract 
information about diseases related to caffeine. 
The goal is to equip the reader with an essential 
set of skills to extract meaningful information 
from any text. 
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Pattern Matching 


We used the grep command in the last chapter 
to find a disease in the text, since grep receives 
as argument a pattern to find an exact match in 
the text, like any search functionality provided 
by conventional text editors. However, we may 
need to search for multiple patterns even when 
interested in a single disease. For example, when 
searching for mentions of malignant hyperther- 
mia, we may also be interested in finding men- 
tions using related expressions, such as: 


MH -— acronym 
MHS - acronym for malignant hyperthermia 
susceptible 


Since we already know how to deal with 
multiple patterns by using the - e option, we may 
easily solve this problem by executing: 


$ grep -e 'malignant 
hyperthermia' -e 'MH' 
MHS' chebi 27732.txt 


-e ' 


Case Insensitive Matching 


When dealing with text, using a case sensitive 
search is usually a good approach to avoid wrong 
matches. For example, acronyms are normally 
in upper case, while the full name is usually in 
lowercase having sometimes the first letter of 
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each word (or only the first word) in uppercase. 
So, instead of usinga full case sensitive grep, 
we might think on performing a case sensitive 
grep for the acronyms and a case insensitive 
grep for the disease words using the -i option: 


$ grep -e 'MH' -e 'MHS' 
Ghebi 27732.txt 

$ grep -i -e 'malignant 
hyperthermia' chebi 27732. 


txt 


The equivalent long form to the -i option 
is --ignore-case. We should note that 
each execution of grep will produce two 
separate lists of matching lines that might be 
overlapped. 

Alternatively, we can also convert it to just 
one case sensitive grep, if we are sure that 
Malignant hyperthermia is the only alternative 
case to malignant hyperthermia present in the 
text. So, we can add it as another pattern: 


$ grep -e 'Malignant 
hyperthermia' -e ' 
malignant hyperthermia' 


-e 'MH' -e 'MHS' chebi 27732. 
txt 
Number of Matches 


To be sure that we are not losing any match, we 
can count the number of matching lines for both 
cases. First we execute a case insensitive grep 
and then we execute a case sensitive grep, both 
using the - c option: 


$ grep -c -i 
hyperthermia' 
txt 

$ grep -c -e 'malignant 
hyperthermia' -e ' 
Malignant hyperthermia' 
chebi 27732.txt 


'malignant 
chebi 27732. 


The equivalent long form to the -c option is - - 
count. 

In our case, the output should show 96 and 95 
matching lines for the insensitive and sensitive 
patterns, respectively. 
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This means that there is a line that is not 
caught by the case sensitive pattern. To identify 
which one is, we can manually analyze each of 
the 96 matching lines one by one. But the goal of 
this book is exactly avoiding these type of tedious 
tasks. One thing we can do to solve this issue is 
to find from the case insensitive matches the one 
that do not match the case sensitive patterns. 


Invert Match 


Fortunately, the grep command has the -v op- 
tion that inverts the matching and returns the 
lines of text that do not contain any matching. 
The equivalent long form to the -v option is - - 
invert-match. 

Thus, if we apply the inverted match with the 
case sensitive patterns to the output given by the 
case insensitive matching, we will get our outlier 
mention: 


$ grep -i 'malignant 
hyperthermia' chebi 27732. 
txt | grep -v -e ' 
Malignant hyperthermia' -e 
'malignant hyperthermia' 


From the output, we can easily identify the 
missing matching line: 


...gene are associated with 
Malignant Hyperthermia (MH) 
and... 


We were missing the case where both words have 
the first letter in uppercase. 

Thus, to obtain all the matching lines in a 
case sensitive match we just have to include the 
missing match as another pattern: 


$ grep -c -e 'malignant 
hyperthermia' -e ' 
Malignant hyperthermia' -e 
'Malignant Hyperthermia' 
chebi 27732.txt 


File Differences 


Another alternative to compare different 
matches, is to use the diff command that 
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receives as input two files and identifies their 
differences. So, we can create two auxiliary files 
and then apply the diff to them: 


$ grep -i 'malignant 
hyperthermia' 
chebi 27732.txt > 
insensitive.txt 

$ grep -e 'Malignant 
hyperthermia' 
-e 'malignant hyperthermia' 
chebi 27732.txt > sensitive 

.txt 

s diff sensitive.txt insensitive 

.txt 


The output should be the same text. 

A problem that may occur with case sensitive 
matching is that some acronyms are defined with 
lowercase letters in the middle, such as ChEBI, 
and humans are not consistent with the way 
they mention them. The same acronym may be 
mentioned in their original form or with all letters 
in uppercase, or just some of them. Moreover, 
these inconsistent mentions sometimes may even 
be found in the same publication. We hope not in 
this book ! — 


Evaluation Metrics 


These inconsistencies made by humans when 
mentioning case sensitive expressions, is one of 
the reasons that most online search engines use 
case insensitive searches as default. This type 
of approach favors recall, while case sensitive 
search favor precision!. 

Recall is the proportion of the number of 
correct matches found by our tool over the total 
number of correct mentions in the texts (found 
or not found). Case insensitive searches avoid 
missing mentions, so they favor recall. 

Precision is the proportion of the number of 
correct matches found by our tool over the total 
number of matches found (correct or incorrect). 
Case sensitive searches avoid incorrect matches, 
so they favor precision. 


lhttps://en.wikipedia.org/wiki/Precision and, recall 
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Normally, there is a trade-off between pre- 
cision and recall. Using a technique that im- 
proves precision, most of the times, will decrease 
recall, and vice-versa. To know how good the 
trade-off is, we can use the F-measure, which 
is the harmonic average of the precision and 
recall. 


Word Matching 


Acronyms (or terms) may also appear inside 
common words or longer acronyms. For 
example, when searching for MH, the word 
victimhood will produce a match: 


5 echo "victimhood" 
MH! 


| grep -i ' 


The problem with victimhood could be easily 
solved by using case sensitive matching, but not 
for a longer acronym. For example, the acronym 
NEDMHM for neurodevelopmental disorder 
with midbrain and hindbrain malformations will 
produce a case sensitive match: 


s echo "NEDMHM" | grep 'MH' 


One way to address this problem is to use the 
-w option of grep to only match entire words, 
i.e. the match must be preceded and followed 
by characters that are not letters, digits, or an 
underscore (or be at the beginning or end of the 
line). The equivalent long form to the -w option 
is --word-regexp. 

Using this option, neither victimhood or 
NEDMHM will produce a match: 


s echo "victimhood" | grep -w -i 
"MH ' 
s echo "NEDMHM" | grep -w -i 'MH' 


Word matching improves precision but de- 
creases recall, since we may miss some less 
common acronyms that we are not aware of, 
but are still relevant for our study. For example, 
consider that we may also be interested in the 
following acronyms: 


?7https://en.wikipedia.org/wiki/Fl score 
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MHE  - acronym for malignant hyperthermia 
equivocal 

MHN  - acronym for malignant hyperthermia 
normal 


If we apply word matching, we will not get a 
match, since both exact matches are followed by 
a letter: 


s echo "MHE and MHN" 
i 'MH' 


| grep -w - 


These are not trivial problems to solve by exact 
pattern matching, we may need regular expres- 
sions to address some of these issues more effi- 
ciently. 


Regular Expressions 


When dealing with natural language text we may 
need more flexibility than the one provided by 
exact matching. Regular expressions are an effi- 
cient tool to extend exact matching with flexible 
patterns, that may find different matches. As an 
example, we may be interested in finding all the 
mentions of the acronym MHS or MHN in a text. 
For doing that, regular expressions provide the 
alternation operator that helps us to solve this 
issue easily by specifying multiple alternatives to 
match in a specific part of the pattern, in this case 
an S or an N as the last character. 

Regular expressions can be better understood 
by clearly separating three distinct components: 


input — any string where we want to find 
something 

pattern — a string that specifies what we are 
looking for 

match — a fragment of the input (a substring) 
where the pattern can be found 


In our examples, the input is the text file 
chebi 27732.txt, but it can be the amino acid 
sequences that we previously extracted from the 
UniProt file entries. Until now the pattern has 
represented an exact string to look for, where 
each match is an exact replica of the pattern 
occurring at a given position of the input string. 
When using regular expressions, the pattern 
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contains special characters, whose purpose are 
not to directly match with the input but instead 
have a special meaning. These special characters 
represent operators that specify which different 
types of strings we want to find in the input. 
For example, strings that start with MH and end 
with S or an N. By using regular expressions, 
the matches are not replicas of the pattern, they 
can be different strings as long as they satisfy the 
specified pattern. 


Extended Syntax 


The grep command allows us the possibility 
to include regular expression operators in the 
input pattern. grep understands two different 
versions of regular expression syntax: basic and 
extended?. We will use the extended syntax 
for two reasons: (1) the basic does not support 
relevant operators, such as alternation; (ii) and to 
clearly differentiate exact matching from regular 
expression matching. Thus, instead of the -e 
option previously used in the grep command, 
we will start to use the - E option, which makes 
the command interpret the pattern as an extended 
regular expression. The equivalent long form to 
the -E option is --extended-regexp. We 
should note that this option does not affects the 
matching when using a pattern without any regu- 
lar expression operator, such as MH. For example, 
the following commands will produce the same 
results: 


s echo -e 'MHS\nMHN' | grep -e 
'MEH' 

s echo -e 'MHS\nMHN' | grep -E 
'MEH' 


Note, that we use the -e option so the echo 
command interpret the \n characters as a new- 
line. Thus, the echo command outputs two lines, 
that are given as input to the grep command. 
We should note that the grep command filters 
lines. 


3https://www.regular- expressions.info/posix.html 
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Alternation 


The first regular expression operator we will test 
is the alternation, which we introduced above. 
An alternation is represented by the bar character 
(|) that specifies a pattern where any match 
must include either the preceding or following 
characters. The preceding and following charac- 
ters can be enclosed within parentheses to better 
specify the scope of the alternation operator. For 
example, the pattern for finding strings that start 
with MH and end with S or an N can be written 
as: 


$ echo -e 'MHS\nMHN' 
'MH(S|N) ' 


| grep -E 


Basic Syntax 

If we use the basic regular expression syntax 
no match will be found, since the alternation 
operator is not supported: 


$ echo -e 'MHS\nMHN' 
'MH(S|N) ' 


| grep -e 


We will have a match only if the | and the 
parentheses are in the input string, since it is not 
interpreted as an operator: 


$ echo -e 'MH(S|N)' 
'MH(S|N) ' 


| grep -e 


Scope 

To better understand the scope of an alternation, 
we can remove the parentheses from the pattern 
and add the -w option: 


s echo -e 'MHS\nMHN' 
-E 'MHS|N' 


| grep -w 


We only get the first line. This is explained be- 
cause the alternation operator is applied to all the 
preceding characters, i.e. the grep will search 
for the MHS word or the N word. If we add a 
single N to the input string we already get another 
match: 


s echo -e 'MHS\nN' 
'MHS |N' 


| grep -w -E 


We can also move the opening parenthesis one 
character to the left: 
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s echo -e 'MHS\nMHN' 
'M(HS|N) ! 


| grep -E 


Only MHS is now displayed, since the alternative 
now represents MN without the H. 


Multiple Alternatives 

We are not limited to two alternatives, we can 
have multiple | operators in a pattern. For exam- 
ple, the following command will find any of the 
three acronyms MHS, MHE or MHN: 


$ echo -e 'MHSNnMHNNnMHE ' 
-E 'MH(S|N[|E) ' 


| grep 


We can now transform our previous grep 
command with multiple case sensitive patterns: 


$ grep -c -e 'Malignant 
hyperthermia' -e ' 
Malignant Hyperthermia' -e 
'malignant hyperthermia' 
chebi 27732.txt 


in a grep command with a single pattern using 
alternation: 


s grep -c -E '(M[m)alignant (H|h) 
yperthermia' chebi 27732. 
txt 


And we will obtain the same 96 matches. 


Multiple Characters 


A useful regular expression feature is that we 
can use the dot character (.) to represent any 
character, so if we want to find all the acronyms 
that start with MH we can execute the following 
command: 


$ grep -o -w -E 'MH.' 
chebi 27732.txt | sort -u 


We should note that we use the - o option of the 
command grep so it just displays the matches 
and not all the line that includes the match. The 
equivalent long form to the -o option is - -only 
-matching. 

The output will be the following three- 
character lines: 
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MH 
MH) 
MH, 
MH. 
MH1 
MH2 
MHE 
MHN 
MHS 


If we really want to match only the dot char- 
acter, we have to precede it with a backslash 
character (\): 


s grep -o -w -E 'MH\.' 
chebi_27732.txt | sort -u 


Now only the MH. will be displayed. 
We can check that there are some matches that 
are not really acronyms, such as MH) and MH,. 


Spaces 

We should note that MH appears because the 
space character can also be matched. For exam- 
ple, the following text includes a word match 
with MH, , since the parenthesis is considered a 
word delimiter character (not a letter, digit or 
underscore): 


susceptible to MH (MHS) 


On the other hand, the following text does not 
include a word match with MH, ;: 


markers and MH 
susceptibility 


Thus, what we really want is matches where the 
third character is a letter or a numerical digit. 
Sometimes, the text includes other characters 
that also represent horizontal or vertical space in 
typography, such as the tab character. All these 
characters are known as whitespaces and can be 
represented by the expression Vs in a pattern’. 
The following command demonstrates that both 
the space and the tab characters are matched by 


\s: 


echo -e 'space: 
grep -E '\s' 


:Mntab:Nt:' | 


*https://en.wikipedia.org/wiki/Whitespace_character 
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Groups 
Fortunately, the regular expressions include the 
group operator that let us easily specify a set of 
characters. A group operator is represented by a 
set of characters enclosed within square brackets. 
Any of the enclosed characters can be matched. 
For example, the previous command to find 
any of the three acronyms can be replaced by: 


$ echo -e 'MHSNnMHNNnMHE ' 
-E 'MH[SNE] ' 


| grep 


We should note that only one of the three letters, 
S. N or E will be matched in the input string. 


Ranges 

Still, this is not solving our need to only match 
letters or digit. However, we can also specify 
characters ranges with the dash character (-). For 
example, to find all the acronyms that start with 
MH followed by any alphabet letter: 


$ grep -o -w -E 'MH[A-Z]' 
chebi_27732.txt | sort -u 


This will result in only three acronyms: 


MHE 
MHN 
MHS 


We should note that A- Z represents any alpha- 
bet letter in uppercase, a lowercase letter will not 
be matched: 


$ echo -e 'MHS\nMHs' 
MH [A-Z] ' 


| grep -E ' 


If we intend to keep the usage of a case sen- 
sitive grep and at the same time find lowercase 
matches, then we need to add the a - z range: 


s echo -e 'MHSWMnMHs' 
MH[A-Za-z]' 


| grep -E ' 


We should note that the dot character inside a 
range represents itself and not any character: 


$ echo -e 'MHS\nMH.'! 
MH[.]' 


| grep -E ' 


Additionally, to include the acronyms that end 
with a numerical digit we need to add the 0-9 
range: 
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$ grep -o -w -E 'MH[A-Z0-9]' 
chebi_27732.txt | sort -u 


Finally, we have the correct list of all three 
character acronyms starting with MH: 


MH1 
MH2 
MHE 
MHN 
MHS 


Negation 

Another frequent case is the need to match any 
character with a few exceptions. For example, if 
we need to find all the matches that start with MH 
followed by any character except an alphabet let- 
ter. Fortunately, we can use the negation feature 
within a group operator. The negation feature is 
represented by the circumflex character (^) right 
next to the left bracket. The negation means that 
all the characters and ranges enclosed within the 
brackets are the ones that cannot be matched. 
Thus, a solution to the above example is to add 
the A-Z range after the circumflex: 


s grep -o -w -E 'MH[^A-Z]' 
chebi_27732.txt | sort -u 


We can see that all of the three acronyms 
MHS, MHE or MHN will be missing from the 
output: 


MH 

MH, 
MH. 
MH) 
MH1 
MH2 


If we do not want the MH, , acronym, we can 
add the space character to the negative group: 


s grep -o -w -E 'MH[^A-Z ]' 
chebi 27732.txt | sort -u 


The output should now contain one less 
acronym: 


MH, 
MH. 
MH) 
MH1 
MH2 
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Above we were interested in finding acronyms 
composed of exactly three characters. However, 
we may need to find all acronyms that start with 
MH independently of their length. This function- 
ality is also available in regular expressions using 
the quantifiers operators. 


Optional 
The simplest quantifier is the optional operator 
that is specified by an item followed by the 
question mark character (?). The item can be a 
character, an operator or a sub-pattern enclosed 
by parentheses. That item becomes optional for 
matching, i.e. a match can either contain that 
item or not. 

For example, to find all the acronyms starting 
with MH and followed by one alphabetic letter or 
none: 


$ grep -o -w -E 'MH[A-Z0-9]?' 
chebi_27732.txt | sort -u 


Given that the third character is optional the 
output will include the two-character acronym 
MH, but not the MH, , match: 


MH 

MH1 
MH2 
MHE 
MHN 
MHS 


We can add the space character to the 
group: 


$ grep -o -w -E 'MH[A-Z0-9 
chebi_27732.txt | sort -u 


Ja 


Now the output includes the two-character 
acronym MH and the MH, , match: 


MH 

MH 

MH1 
MH2 
MHE 
MHN 
MHS 
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Multiple and Optional 

To find all the acronyms independently of their 
length, we can use the asterisk character (*). 
The preceding item becomes optional and can be 
repeated multiple times. For example, to find all 
the acronyms starting with MH and which may 
be followed any number of alphabetic letters or 
numeric digits: 


$ grep -o -w -E 'MH[A-Z0-9]«' 
chebi 27732.txt | sort -u 


The output now includes the four-character 
acronym MHSITI: 


MH 
MH1 
MH2 
MHE 
MHN 
MHS 
MHS1 


We should note that the grep command 
uses a greedy approach, i.e. it will try to match 
as many characters as possible. For example, 
the following command will match MH/ and 
not MH: 


s echo 'MH1' 
[0-9] «' 


| grep -o -E 'MH 


Multiple and Compulsory 

To make the preceding item compulsory and able 
to repeat it multiple times, we may replace the 
asterisk by the plus character (+). For example, 
the following pattern will find all the acronyms 
starting with MH followed by at least one alpha- 
betic letter or numeric digit: 


$ grep -o -w -E 'MH[A-Z0-9]+' 
chebi 27732.txt | sort -u 


We should note that the output does not con- 
tain the two character acronym MH: 


MH1 
MH2 
MHE 
MHN 
MHS 
MHS1 
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All Options 
The above quantifiers are the most popular, but 
the functionality of all of them can be reproduced 
by using curly braces to specify the minimal and 
maximum number of occurrences. The item is 
followed by an expression of the type {n,m} 
where n and m are to be replaced by a number 
specifying the minimum and maximum number 
of occurrences, respectively. n and m may also 
be omitted, which means that no minimum or 
maximum limit is to be imposed. 

Using curly brackets, the question mark char- 
acter (?) can be replaced by {0,1}. Thus, the 
following two patterns are equivalent: 


$ grep -o -w -E 'MH[A-Z0-9]?' 
chebi 27732.txt | sort -u 
$ grep -o -w -E 'MH[A-Z0 
-9] {0,1}' chebi 27732.txt 
| sort -u 


The asterisk character (*) can be replaced 
by {0, }. Thus, the following two patterns are 
equivalent: 


$ grep -o -w -E 'MH[A-Z0-9]«' 
chebi_27732.txt | sort -u 

s grep -o -w -E 'MH[A-Z0-9]{0,}! 
chebi_27732.txt | sort -u 


The plus character (+) can be replaced by 
{1,}. Thus, the following two patterns are 
equivalent: 


$ grep -o -w -E 'MH[A-Z0-9]+' 
chebi_27732.txt | sort -u 
s grep -o -w -E 'MH[A-Z0-9]{1,}! 
chebi_27732.txt | sort -u 
On the other hand using {1,1} is the same as 


not having any operator. Thus, the following two 
patterns are equivalent: 


s grep -o -w -E 'MH[A-Z0-9]' 
chebi 27732.txt | sort -u 

s grep -o -w -E 'MH[A-Z0 
-9]{1,1}' chebi 27732.txt 
| sort -u 


The previous commands display the all the 
three-character acronyms: 


Position 


MH1 
MH2 
MHE 
MHN 
MHS 


For example, if we are looking for acronyms 
with exactly 4 characters then we can apply the 
following pattern: 


$ grep -o -w -E 'MH[A-ZO 
-91(2,2)' chebi 27732.txt 
| sort -u 


We should note that we use 2 as both the mini- 
mum and maximum since MH already count as 2 
characters. 

The output of the previous command is now 
the four-character acronym: 


MHS1 


Position 


Sometimes besides the match, we are also inter- 
ested in limiting the matches to specific parts of 
the input string. For example, to identify start 
and stop codons in a protein sequence, we need 
to limit the matches to the beginning or the end 
of the sequence. In text, we may for example 
be interested in lines starting with a name of 
a disease. To take in account the position of a 
match regular expressions patterns can start with 
the circumflex character (^) and/or end with the 
dollar sign character ($). 

If the pattern starts with a circumflex then 
only matches at the beginning of the line will be 
considered. On the other hand, if the pattern ends 
with a dollar then only matches at the end of the 
line will be considered. 


Beginning 


For example, if we are looking for lines starting 
with Malignant Hyperthermia we can use the 
following pattern: 
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$ grep -E '^(M[m)alignant (H|h) 
yperthermia' chebi 27732. 
txt 


The output will include the list of lines begin- 
ning with a mention to Malignant Hyperthermia: 


Malignant hyperthermia (MH) is a 
potentially fatal autosomal 


Malignant hyperthermia (MH) is a 
pharmacogenetic disorder 


To check how many of the matching lines 
were filtered, we can count the number of oc- 
currences when using the circumflex and when 
not: 


$ grep -c -E'^(M[m)alignant (H|h) 


yperthermia' chebi 27732. 
txt 

$ grep -c -E' (M[m)alignant (H|h) 
yperthermia' chebi 27732. 
txt 


The output will show that only 23 of the 96 
matches were considered. 


Ending 


If we are looking for lines ending with a mention 
to Malignant Hyperthermia, then we can add the 
dollar character to the end of the pattern: 


s grep -E '(M[m)alignant (H|h) 
yperthermia.$' chebi 27732 
.txt 


To allow a punctuation character before the end 
of the line, we added the dot character before the 
dollar character in the pattern. The dot character 
matches any character, including the dot itself. 

The output will be the list of lines ending with 
a mention to Malignant Hyperthermia: 


Novel mutation in the RYR1 gene 
(R2454C) in a patient with 
malignant hyperthermia. 
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Identification of a novel 
mutation in the ryanodine 
receptor gene (RYR1) in 
patients with malignant 
hyperthermia. 

Novel skeletal muscle ryanodine 
receptor mutation in a large 
Brazilian family with 
malignant hyperthermia. 


We can check how many lines were filtered by 
using again the -c option: 


s grep -c -E '(M|m)alignant (H|h) 
yperthermia.$' chebi 27732 
.txt 


s grep -c -E '(M|m)alignant (H|h) 
yperthermia' chebi 27732. 
txt 


The output will show that only 15 of the 96 
matches were at the end of the line. 


Near the End 


Sometimes we do not want the mention ending 
exactly at the last character. We may be more 
flexible and allow a following expression, or 
a given number of characters. For example, to 
allow 10 other characters between the end of the 
line and the mention of Malignant Hyperthermia, 
we can add a quantifier to the dot operator: 


s grep -c -E '(M[m)alignant (H|h 
)yperthermia.(0,10]$' 
chebi 27732.txt 


The output will show that we have 20 matches. 

If we remove the - c option, we will be able to 
check that words, such as families and patients, 
are now allowed to appear between the mention 
of Malignant Hyperthermia and the end of the 
line: 


Novel mutations in C-terminal 
channel region of the 
ryanodine receptor in 
malignant hyperthermia 
patients. 
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Novel missense mutations and 
unexpected multiple changes 
of RYR1 gene in 75 malignant 
hyperthermia families. 


Word in Between 


To allow a word in between, independently of 
its length, we can add to the pattern an optional 
sequence of non-space characters (the word) pre- 
ceded by a space: 


s grep -c -E '(M[m)alignant (H|h) 
yperthermia( [^ 1*)?.$' 
chebi 27732.txt 


The output will show that we have 24 matches. 
We should note that the [^ ] operator avoids 
having two words. 

If we remove the -c option, we will be able 
to check that lengthy words (with more than 
10 characters), such as susceptibility, are now 
allowed to appear between the mention of Ma- 
lignant Hyperthermia and the end of the line: 


Ryanodine receptor gene point 
mutation and malignant 
hyperthermia susceptibility. 


Full Line 


If we want lines that start with a mention to Ma- 
lignant Hyperthermia and end with an acronym, 
MH or MHS, then we can execute two grep 
commands. The first gets the lines starting with 
Malignant Hyperthermia and the next filters the 
output of the latter with lines ending with an 
acronym: 


s grep -E '^(M[m)alignant (H|h) 
yperthermia' chebi 27732. 
txt | grep -w -E 'MHS?.$' 


Alternatively, we can add both the circum- 
flex and dollar operators to the same pattern. 
However, we cannot forget to add .« to match 


Tokenization 


anything in between them, since we are asking 
full line matches: 


s grep -w -E'*(M|m)alignant (H|h) 
yperthermia.xMHS?.$' 
chebi 27732.txt 


We can see that both commands match all the 
text of the abstract since each abstract is stored in 
a single line of the file: 


Malignant hyperthermia (MH) is a 
pharmacogenetical 
complication as for 
genetic diagnosis of MH. 

Malignant hyperthermia 
susceptibility (MHS) is a 


subclinical pharmacogenetic 
disorder been tested 
positive for MHS. 


This demonstrates the problem of tokenization, 
since usually what we really need is to match a 
full sentence or a phrase. And in that case each 
line should represent a sentence or phrase from 
the abstract. 


Match Position 


For more advanced processing, we may be in- 
terested in knowing the exact position of the 
matches in a given line. This can be done by 
using the -b option of grep, which provides the 
number of bytes in the line before the start of the 
match: 


s echo 
O -w -E 


'MHS MHN MHE' | grep -b - 
"MH [SNE] ' 


The equivalent long form to the -b option is - - 
byte-offset. 

The output shows the list of matches preceded 
by their position in the given line: 


0:MHS 
4:MHN 
8:MHE 
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Tokenization 


As we have shown in the previous section, some- 
times we need to work at the level of a sentence 
and not use a full document as the input string. 
Tokenization is a Natural Language Processing 
(NLP) task that aims at identifying boundaries 
in the text to fragment it into basic units called 
tokens. These tokens can be sentences, phrases, 
multi-word expressions, or words. 


Character Delimiters 


In most languages, some specific characters can 
be considered as accurate boundaries to fragment 
text into tokens. For example, the space character 
to identify words; the period (.), the question 
mark (?) and the exclamation mark (!) to identify 
the ending of a sentence; and the comma (,), 
the semicolon (;), the colon (:) or any kind of 
parenthesis to identify a phrase within a sentence. 
However, this problem may be more complex in 
languages without explicitly delimiters, such as 
Chinese (Wu and Fung 1994). 

A common approach to tokenization is to use 
regular expressions to replace these delimiters by 
newline characters. This will result in a token per 
line. For example, we can replace the characters 
specifying the end of a sentence with a newline 
by using the tr command and then count the 
number of lines: 


5 br "Es Mn! 
txt | wc -1 


1?]' « chebi 27732. 


We get 1493 lines from the original 248 
lines: 


s wc -1 chebi 27732.txt 


Unfortunately, this is not just so simple. We 
need to analyze the output: 
& te "Es Mn! 

txt | less 


1?]' « chebi 27732. 
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Wrong Tokens 


We can check that: (i) many lines are empty 
because an extra newline character will be added 
to the last sentence, and (ii) the dot character is 
also used as a decimal mark in a number, then 
some sentences are split in multiple lines because 
they have decimal number in them. For example, 
the original sentence: 


These 10 mutations account for 
21.9% of the North American 
MH-susceptible population 


is split in two lines: 


These 10 mutations account for 
21 

9% of the North American MH- 
susceptible population 


String Replacement 


This means that looking at just one character is 
not enough, we need some context. For perform- 
ing this, we will use the sed command that we 
may consider as a more powerful version of the 
tr command. The sed command is a stream ed- 
itor that can receive as input a string and perform 
basic text transformations, such as replace one 
expression by another, that are available in almost 
all text editors. For example, we can use a simple 
Sed to convert every mention of caffeine by its 
ChEBI identifier: 


s sed -E 's/caffeine/CHEBI 
:27732/gi' chebi 27732.txt 


The -E option allow us to use extended regular 
expressions, like we used before in grep. The 
S option has the following syntax 's/FIND/ 
REPLACE/FLAGS', where: FIND is the pattern 
to find in the input string; REPLACE the expres- 
sion to replace the matches; FLAGS are multiple 
options, such as g to replace all matches in each 
line and not just the first one, and i to be case 
insensitive. 


4 Text Processing 


For example, the original fragment of text: 


link between the caffeine 
threshold and tension 


will be converted to: 


link between the CHEBI:27732 
threshold and tension 


Multi-character Delimiters 


To replace the delimiter characters by a newline 
when followed by at least one space character, we 
can use the following command: 


s sed -E 's/[.!?] +/\n/g' 
chebi 27732.txt 


We should note that by making compulsory a 
space character, we avoid: (i) empty lines by 
splitting a sentence that is already at the end of 
the line (assuming there are no ghost space char- 
acters at the end of each line), and (ii) decimal 
markers because they are followed by numerical 
digits and not spaces. 

We now get 1067 lines from the original 248 
lines: 


$ sed -E 's/[.!?] +/An/g' 
chebi 27732.txt | wc -1 


Keep Delimiters 


The previous sed command is removing the 
delimiter characters from the text, and this 
may cause other problems. The best solution 
is to keep the delimiter characters and just 
add the newline. The sed command allows 
us to keep each match for a specific part of 
the pattern (sub-pattern) by enclosing it within 
parentheses. To include the match of a sub- 
pattern in the replace expression, we can use 
the backslash and its numerical order. Thus, we 
can improve our sed command by using this 
technique so we do not remove any delimiter 
character: 


s sed -E 's/([.12]) ( +)/\1\n\2/g 
T chebi_27732.txt 


Entity Recognition 


However, other common issues may still per- 
sist. For instance, there are some sentences start- 
ing right after the delimiter characters without 
any space in between: 


bulk.Fetal 
sequencing.Whole 


These sentences include a delimiter character 
directly followed by an alphabetic letter: 


s sed -E 's/([ ) C +)/\1\n\2/g 
' chebi 27732.txt | grep - 
i '[.!?] [a-z]'' 
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To minimize this issue, we can change the pat- 
tern so the compulsory space character become 
optional, but requiring a following uppercase 
alphabetic letter: 


s sed -E 's/([.!?])( *[A-2])/\1\ 
n\2/g' chebi 27732.txt | 
wc -1 


We now get 1127 lines, ie. this pattern is 
more flexible and was able to split more 60 
sentences. This does not mean that is free of 
errors. It is almost impossible to derive a rule 
that covers all the possible typos humans can 
produce. 

As an example, Fig.4.1 show a complex 
pattern. adapted from Wikipedia. The pattern 
is equivalent to V. (2, ) [A-Z], and identifies 
multiples spaces at the beginning of a sentence. 
The pattern requires at least two spaces to be 
matched, but only after a period and before an 
uppercase letter. 


Sentences File 


Using our previous pattern, we can update 
our script named gettext.sh to provide the text 
already split in sentences by adding the sed 
command: 


i ID-$1 4 The CHEBI identifier 
given as input is renamed 


to ID 

2 grep -e '<title>' -e '«rdfs: 
comment>' chebi\ SIDN >. 
rdf | \ 


Fig. 4.1 Identifying multiple spaces at the beginning of a 
sentence using regular expressions (Adapted from: https:// 
en.wikipedia.org/wiki/Regular_expression) 


3 gawk -F'[«»]' '( print $3 }' | 


N 
4 sed -E 's/([.!?]) ( *[A-Z])/\1\ 
n\2/g' 


To save the output as a file named chebi_27732_ 
sentences.txt, we only need to add the redirection 
operator: 


$ ./gettext.sh 27732 > 
chebi 27732 sentences.txt 


Each line of the file chebi 27732. sentences.txt 
represents a sentence. 


Entity Recognition 


To select the sentences with one of our acronyms, 
we can use the grep command and our sen- 
tences file: 


s grep -w -E 'MH[SNE]?' 
chebi 27732 sentences.txt 


The output will only include matching sen- 
tences: 


Interestingly, the data suggest 
a link between the caffeine 
threshold and tension values 
and the MH/CCD phenotype. 


Alternatively, we can use the -n option to get 
the number of the line and the -o option to get 
the acronym matched: 


58 


$ grep -n -o 'MH [SNE] ?' 
chebi_27732_sentences.txt 


-w -E 


The equivalent long form to the -n option is 
--line-number. The output should be some- 
thing like this: 


1106 
1106 
1108 :MH 
1110 :MH 
1111:MH 


:MH 
:MH 


We can now make a script that receives a 
pattern as argument and the input text as the 
standard input, to display the line numbers and 
the matches in a TSV format. Thus, let us create a 
script file named getentities.sh with the following 
lines: 


1 PATTERN=$1 
2 grep -n -o -w -E SPATTERN | Y 
3 tr !':!' '\t! 


Again we should not forget to save the file in our 
working directory, and add the right permissions 
with chmod, as we did with our scripts in the 
previous chapter. 

The first line stores the pattern given as ar- 
gument in the variable PATTERN. The grep 
command finds the matches and the t x command 
replaces each colon by a tab character to produce 
TSV content. 

We can now execute the script giving the 
pattern as argument and the sentences file as 
standard input: 


$ ./getentities.sh 'MH[SNE]?' < 
chebi 27732 sentences.txt 


The output should be something like this: 


1106 


MH 
1106 MH 
1108 MH 
1110 MH 
1111 MH 


We should note that now we have the values 
separated by a tab character, i.e. the output is in 
TSV format. 


4 Text Processing 


The output can also be saved as a TSV file 
that we can open directly in our preferred spread- 
sheet application. For example, to save it as 
chebi_27732.tsv, we only need to add the redi- 
rection operator: 


s ./getentities.sh 'MH[SNE]?' < 
chebi 27732 sentences.txt 
» chebi 27732.tsv 


Select the Sentence 


If we want to analyze a specific matched sen- 
tence, we can use a text editor and go to that 
line number. A more efficient alternative is to use 
the print p option of sed to output a given line 
number. For example, to check the MHS match at 
line 2: 


$ sed -n '2p' 
chebi 27732 sentences.txt 


Now we can easily check the context of the 
match: 


in susceptible people (MHS) 


by volatile 


Pattern File 


The script created in the previous section only 
accepts one pattern, however we may need to 
recognize different entities, or different mentions 
of the same entity, such as the official name, pos- 
sible synonyms, and the acronyms. Fortunately, 
grep allows us to include a list of patterns 
directly from a file using the -f option. The 
equivalent long form to the -f option is -- 
file-FILE. For example, we can create a text 
file named patterns.txt with the following three 
patterns: 


(M|m)alignant (H|h)yperthermia 
MH [SNE] ? 
(C|c) a£feine 


Then we can execute the previous grep but 
using multiple patterns specified in the pattern 
file: 


Relation Extraction 


$ grep -n -o -w -E -f patterns. 
txt chebi 27732 sentences. 
txt 


Analyzing the output, we can check that the 
same sentences may include different entities: 


1110:MH 
1110:caffeine 
1111:caffeine 
1111:MH 


We can now update our script named geten- 
tities.sh to receive as input not a single pattern 
but the filename where multiple patterns can be 
found. 


1 PATTERNS-$1 
2 grep -n -o -w -E -f SPATTERNS 
PA 


» tr tgi 


D XE 1 
We can execute the script giving as argument 


the file containing the patterns: 


s ./getentities.sh patterns.txt 
< chebi 27732 sentences. 
txt 


To save the output as a file named chebi_27732.tsv,111: 


we only need to add the redirection operator: 


s ./getentities.sh patterns.txt 
« chebi 27732 sentences. 
txt » chebi 27732.tsv 


Using the patterns.txt file is very useful if for 
example we are not focused in a single disease, 
and we want to find any disease mentioned in 
the text. In these cases, we have to create a file 
with the full lexicon of diseases. This topic will 
be addressed in the following chapter. 


Relation Extraction 


Finding the relevant entities in text is some- 
times not enough. We need to know which sen- 
tences may describe possible relationships be- 
tween those entities, such as a relation between 
a disease and a compound. 


59 


This is a complex text mining challenge, but a 
simple approach is to construct a pattern that al- 
low any kind of characters between two entities: 


s grep -n -w -E 'MH[SNE] ?.*(C|c) 
affeine' 
chebi 27732 sentences.txt 


The following sentence is one of the seven 


displayed sentences mentioning a possible 
relation: 
239: MHS families were 


investigated with a caffeine 


However, we are missing all the sentences that 
have caffeine first: 


$ grep -n -w -E '(C|c)affeine.x 
MH [SNE] ?' 
chebi_27732_sentences.txt 


We will be able to see that sometimes caffeine 
comes first: 


801: caffeine-halothane 
contracture test were greater 
in those who had a known MH 


caffeine threshold and 
tension values and the MH 


Multiple Filters 


The most flexible approach is use two grep 
commands. The first selects the sentences men- 
tioning one of the entities, and the other selects 
from the previously selected sentences the ones 
mentioning the other entity. For example, we 
can first search for the acronyms and then for 
caffeine: 


$ grep -n 'MH [SNE] ?' 
chebi_27732_sentences.txt 
| grep -w -E '(C|c)affeine 


-w -E 


This will show all the nine sentences mentioning 
caffeine and an acronym. 
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Relation Type 


If we are interested in a specific type of rela- 
tionship, we may have an additional filter for a 
specific verb. For example, we can add a filter for 
sentences with the verb response or diagnosed: 


$ grep -n -w -E 'MH[SNE]?' 
chebi 27732 sentences.txt 
| grep -w -E '(C|c)affeine 
' | grep -w -E 'response | 
diagnosed' 


We should note that this does not take in account 
where the verb appears in the sentence. For exam- 
ple, in the following sentence the verb response 
appears first than any of the two entities: 


50: The relationship between the 
IVCT response and genotype 

was the number of MHS 

discordants at 2.0\,mM 


caffeine 


If the verb needs to appear between the two 
entities, we have to construct a pattern that have 
these words in the middle of them: 


$ grep -n -w -E 'MH[SNE]?.*«( 
response |diagnosed) .« (C|c) 
affeine' 
chebi_27732_sentences.txt 


We can see now that the previous sentence (line 
50) is not presented as a match. 


4 Text Processing 
Remove Relation Types 


We may also be interested in ignoring specific 
type of relations. To do that, we only need to 
use the -v (or --invert-match) option. For 
example, to ignore sentences with the word re- 
sponse or diagnosed: 


$ grep -n -w -E 'MH[SNE]?' 
chebi_27732_sentences.txt 
| grep -w -E '(C|c)affeine 
' | grep -v -w -E ' 
response |diagnosed' 


All the resulting sentences do not mention 
response or diagnosed. 


If we want to have a deeper knowledge about 
text processing tasks and challenges, we 
may be interested in reading some chapters 
of the book entitled Speech and language 
processing (Jurafsky and Martin 2014). 
The book is a highly specialized document 
explaining in full detail the topics here briefly 
described. 

To have an overview about the state-of-art in 
text processing tools using biomedical literature, 
we should consider reading a recent and compre- 
hensive survey (Lamurias and Couto 2019). 
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Abstract 

In the previous chapter we were able to auto- 
matically process text by recognizing a limited 
set of entities. This chapter will introduce the 
world of semantics, and present step-by-step 
examples to retrieve and enhance text and data 
processing by using semantics. The goal is to 
equip the reader with the basic set of skills to 
explore semantic resources that are nowadays 
available using simple shell script commands. 
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Classes 


In the previous chapters we searched for men- 
tions of caffeine and malignant hyperthermia in 
text. However, we may miss related entities that 
may also be of our interest. These related entities 
can be found in semantic resources, such as on- 
tologies. The semantics of caffeine and malignant 
hyperthermia are represented in ChEBI and DO 
ontologies, respectively. 
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Check for 
updates 


OWL Files 


Thus, we can start by retrieving both ontologies, 
i.e. their OWL files. 


$ curl -O 'https://raw. 
githubusercontent.com/ 
DiseaseOntology/ 
HumanDiseaseOntology/ 
master/src/ontology/ 
releases/2018-11-02/doid. 


owl' 
$ curl -O 'ftp://ftp.ebi.ac.uk/ 
pub/databases/chebi/ 


archive/rel169/ontology/ 
chebi_lite.owl' 


The -O option saves the content to a local file 
named according to the name of the remote file, 
usually the last part of the URL. The equivalent 
long form to the -O option is - -remote-name. 

The previous commands will create the files 
chebi lite.owl and doid.owl, respectively. We 
should note that these links are for the specific 
releases used in this book. Using another release 
may change the output of the examples presented 
in this chapter. 

The links may also change in the future, so 
we may need to check them on the BioPortal! or 


'http://bioportal.bioontology.org/ 
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on the OBO Foundry? webpages. Alternatively, 
we can also get the OWL files from the book file 
archive”. 


Class Label 


Both OWL files use the XML format syntax. 
Thus, to check if our entities are represented in 
the ontology, we can search for ontology ele- 
ments that contain them using a simple grep 
command: 


$ grep '>malignant hyperthermia 
<' doid.owl 

$ grep '>caffeine<' chebi lite. 
owl 


For each grep the output will be the line that 
describes the property label (rdfs:label), which is 
inside the definition of the class that represents 
the entity: 


«rdfs:label rdf:datatype-"http: 
/ / www .w3.0rg/2001/XMLSchemats 
string"»malignant 
hyperthermia</rdfs:label> 

<rdfs:label rdf:datatype="http: 
//www.w3 .org/2001/XMLSchema# 
string">caffeine</rdfs:label> 


Class Definition 


To retrieve the full class definition, a more effi- 
cient approach is to use the xm11int command, 
which we already used in previous chapters: 


s xmllint --xpath "//x*[local- 
name()='label' and text() 
='malignant hyperthermia 
']/.." doid.owl 


The XPath query starts by finding the label that 
contains malignant hyperthermia and then .. 
gives the parent element, in this case the Class 
element. 


7 http://www.obofoundry.org/ 
3http://abs.rd.ciencias.ulisboa.pt/book/ 


5 Semantic Processing 


From the output we can see that the semantics 
of malignant hyperthermia is much more than its 
label: 


<owl:Class rdf:about="http:// 
purl.obolibrary.org/obo/ 
DOID_8545"> 
«rdfs:subClassOf 
rdf:resource-"http:// 
purl.obolibrary.org/obo 
/DOID_0050736"/> 
«rdfs:subClassOf 
rdf:resource-"http:// 
purl.obolibrary.org/obo 
/DOID 66 "/> 
«rdfs:subClassOf- 
«owl:Restriction» 
«owl:onProperty 
rdf :resource=" 
http: //purl. 
obolibrary.org/ 
obo/IDO_0000664"/ 
> 
<owl : someValuesFrom 
rdf :resource=" 
http: //purl. 
obolibrary.org/ 
obo/GENO_0000147" 
/> 
«/owl:Restriction» 
«/rdfs:subClassOf» 
«obo:IAO 0000115 


«oboInOwl:hasDbXref 
rdf :datatype="http:// 
www.w3.org/2001/ 
XMLSchema#string"> 
UMLS CUI:C0024591«/ 
oboInOwl:hasDbXref» 

«oboInOwl:hasExactSynonym 
rdf:datatype-"http:// 
www.w3.org/2001/ 
XMLSchema#string"> 
anesthesia related 
hyperthermia</ 
oboInOwl:hasExactSynonym 
> 

«oboInOwl:hasExactSynonym 
rdf:datatype-"http:// 
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www.w3.org/2001/ 
XMLSchema#string"> 
malignant hyperpyrexia 
due to anesthesia</ 
oboInOwl:hasExactSynonym 
> 

«oboInOwl:hasOBONamespace 
rdf :datatype="http:// 
www.w3.org/2001/ 
XMLSchema#string"> 
disease ontology</ 
oboInOwl:hasOBONamespace 
2 

«oboInOwl:id rdf:datatype- 
"http://www.w3.org 
/2001/XMLSchema#string" 
»DOID:8545«/oboInOwl:id 
2 

«oboInOwl:inSubset 
rdf:resource="http:// 
purl.obolibrary.org/obo 
/doid#DO_MGI_slim"/> 

«oboInOwl:inSubset 
rdf:resource="http:// 


Class: malignant hyperthermia 
Term IRI: http://purl.obolibrary.orglobo/DOID 8545 


Definition: A muscle tissue disease that is characterized by a drastic and uncontrolled increase in skeletal muscle oxidative metabolism, which overwhelms 
the body's capacity to supply oxygen, remove carbon dioxide, and regulate body temperature. [database cross reference: 
url:http://en.wikipedia.org/wiki/Malignant hyperthermia][database cross reference: url:http://en.wikipedia.org/wiki/Malignant_hyperthermia] 
[database cross reference: url:http:;//en.wikipedia.org/wiki/Malignant hyperthermia][database cross reference 
urf:http.//en.wikipedia.org/wiki/Malignant hyperthermia][database cross reference: url:http://en.wikipedia.org/wiki/Malignant hyperthermia] 

Annotations 
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purl.obolibrary.org/obo 
/doid#DO_rare_slim"/> 
«oboInOwl:inSubset 
rdf :resource="http:// 
purl .obolibrary.org/obo 
/doid#NCIthesaurus"/> 
«rdfs:comment rdf:datatype 
="http://www.w3.org 
/2001/XMLSchema#string" 
>Xref MGI. 


OMIM mapping confirmed by Do. [ 
SN] .</rdfs:comment> 


<rdfs:label rdf:datatype=" 
http: //www.w3.org/2001/ 
XMLSchema#string"> 
malignant hyperthermia< 
/rdfs:label> 
</owl:Class> 


A graphical visualization of this class is de- 
picted in Fig. 5.1. 

For example, we can check that malignant 
hyperthermia is a subclass of (specialization) the 
entries 0050736 and 66. We can directly use the 


* database cross reference: ICD9CM:995.86; MESH:D008305; ICD10CM:T88.3; UMLS CUI:C0024591; ORDO:423; CSP2005:2871-4352; GARD:6964; 
MTHICD9_2006:995.86; NCI:C84869; OMIM:PS145600 


has exact synonym: anesthesia related hyperthermia; malignant hyperpyrexia due to anesthesia 


. 

* has obo namespace: disease ontology 

+ http://www.w3.org/2000/01/rdf-schemaicomment: Xref MGI. OMIM mapping confirmed by DO. [SN] 
+ id: DOID:8545 

* in subset: DO MGI slim; DO rare slim; NCithesaurus 


Class Hierarchy 


Fig. 5.1 Class description of malignant hyperthermia in the Human Disease Ontology (Source: http://www.ontobee. 


org/) 
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link* in our browser to know more about this 
parent disease. We will see that it represents a 
muscle tissue disease. This means that malignant 
hyperthermia is a special case of a muscle tissue 
disease. 

We can do the same to retrieve the full class 
definition of caffeine: 


s xmllint --xpath "//x*[local- 
name()='label' and text() 
='caffeine']/.." 
chebi lite.owl 


From the output we can see that the types of 
semantics available for caffeine differs from the 
semantics of malignant hyperthermia, but they 
still share many important properties, such as the 
definition of subClassOf: 


«owl:Class rdf:about="http:// 
purl.obolibrary.org/obo/ 
CHEBI 27732"» 

«rdfs:subClassOf 
rdf:resource-"http:// 
purl.obolibrary.org/obo 
/CHEBI_26385"/> 

«rdfs:subClassOf 
rdf:resource-"http:// 
purl.obolibrary.org/obo 
/CHEBI_27134"/> 

«rdfs:subClassOf- 
<owl:Restriction> 

«owl:onProperty 
rdf:resource-" 
http://purl. 
obolibrary.org/ 
obo/RO_0000087"/> 

<owl : someValuesFrom 
rdf :resource=" 
http: //purl. 
obolibrary.org/ 
Obo/CHEBI 25435"/ 
> 

</owl :Restriction> 
</rdfs:subClassOf> 


<rdfs:subClassof> 
<owl:Restriction> 


^http://purl.obolibrary.org/obo/DOID 66 


5 Semantic Processing 


<owl:onProperty 
rdf:resource="http: 
//purl.obolibrary. 
org/obo/RO_0000087"/ 
> 
«owl:someValuesFrom 
rdf:resource-" 
http://purl. 
obolibrary.org/ 
Obo/CHEBI 85234"/ 
> 
</owl:Restriction> 
«/rdfs:subClassOf» 
«obo:IAO 0000115 
rdf:datatype-"http:// 
www.w3.org/2001/ 
XMLSchema#string">A 
trimethylxanthine in 
which the three methyl 
groups are located at 
positions 1, 3, and 7. 
A purine alkaloid that 
occurs naturally in tea 
and coffee.</ 
obo: IAO 0000115» 
«oboInOwl:hasAlternativeId 
rdf:datatype-"http:// 
www.w3.org/2001/XML 
Schema#string">CHEBI: 
22982«/oboInOwl:has 
AlternativeId» 
«oboInOwl:hasAlternativeId 
rdf:datatype-"http:// 
www.w3.org/2001/ 
XMLSchema#string"> 
CHEBI:3295«/oboInOwl: 
hasAlternativeld> 
«oboInOwl:hasAlternativeId 
rdf:datatype-"http:// 
www.w3.org/2001/XML 
Schema#string">CHEBI: 
41472«/oboInOwl: 
hasAlternativeld> 
«oboInOwl:hasOBONamespace 
rdf:datatype-"http:// 
www.w3.org/2001/ 
XMLSchema#string"> 


Classes 


chebi_ontology</ 
oboInOwl:hasOBONamespace 
> 

<oboInOwl:id rdf:datatype= 
"http: //www.w3.org 
/2001/XMLSchema#string" 
>CHEBI :27732</ 
oboInOwl:id» 

«oboInOwl:inSubset 
rdf:resource-"http:// 
purl.obolibrary.org/obo 
/chebi#3 STAR"/> 

<rdfs:label rdf:datatype=" 
http: //www.w3.org/2001/ 
XMLSchema#string"> 
caffeine</rdfs:label> 

</owl:Class> 


A graphical visualization of this class is de- 
picted in Fig. 5.2. 

The class caffeine is a specialization of two 
other entries: 26385 (purine alkaloid*), and 
27134 (trimethylxanthine^). However, it contains 
additional subclass relationships that do not 
represent subsumption (is-a). 


Related Classes 


Figures 5.3 and 5.4 show other related classes 
of malignant hyperthermia and caffeine, respec- 
tively. 

For example, the relationship between caf- 
feine and the entry 25435 (mutagen’) is defined 
by the entry 0000087 (has role*) of the Relations 
Ontology. This means that the relationship de- 
fines that caffeine has role mutagen. 

We can also search in the OWL file for the 
definition of the type of relation has role: 


s xmllint --xpath "//x*[local- 
name ()2'ObjectProperty'] [e 
x [local-name()='about']=' 
http://purl.obolibrary.org 
/obo/RO 0000087'] " 
chebi lite.owl 


*http://purl.obolibrary.org/obo/CHEBI. 26385 
Éhttp://purl.obolibrary.org/obo/CHEBI. 27134 
Thttp://purl.obolibrary.org/obo/CHEBI_25435 
8http://purl.obolibrary.org/obo/RO_0000087 
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The XPath query starts by finding the elements 
ObjectProperty and then selects the ones 
containing the about attribute with the relation 
URI as value. 

We can check that the relation is neither tran- 
sitive or cyclic: 


<owl:ObjectProperty rdf:about=" 
http://purl.obolibrary.org/ 
obo/RO_0000087"> 
«oboInOwl:hasDbXref 
rdf:datatype-"http:// 
www.w3.org/2001/ 
XMLSchema#string"> 
RO:0000087</ 
oboInOwl:hasDbXref» 
«obolInOwl:hasOBONamespace 
rdf:datatype-"http:// 
www.w3.org/2001/ 
XMLSchema#string"> 
chebi_ontology</ 
oboInOwl:hasOBONamespace 
> 
«oboInOwl:id rdf:datatype= 
"http: //www.w3.org 
/2001/XMLSchema#string" 
»has role«/oboInOwl:id» 


«oboInOwl:is cyclic rdf: 
datatype-"http://www.w3 
.org/2001/XMLSchema# 
boolean">false</ 
oboInOwl:is cyclic» 

«oboInOwl:is transitive 
rdf:datatype-"http:// 
www.w3.org/2001/ 
XMLSchema#boolean"> 
false«/oboInOwl: 
is transitive» 


<oboInOwl:shorthand rdf: 
datatype-"http://www.w3 
.org/2001/XMLSchema# 
string">has_role</ 
oboInOwl:shorthand- 
«rdfs:label rdf:datatype-" 
http://www.w3.0rg/2001/ 
XMLSchema#string">has 
role</rdfs:label> 
«/owl:ObjectProperty» 
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Class: caffeine 
Term IRI: http://purl.obolibrary,org/obo/CHEBI 27732 


5 Semantic Processing 


Definition: A trimethylxanthine in which the three methyl groups are located at positions 1, 3, and 7. A purine alkaloid that occurs naturally in tea and 
coffee. 


Annotations 


* database cross reference: PMID:15257305; PMID: 10822912; PMID:18421070; PMID:16528931; PMID:22770225: PMID: 12943586; PMID:17957400; 
PMID:8679661; PMID:12397877; KNApSAcK:C00001492; PMID:14521986; PMID:11815511: PMID:11431501; PMID:20164568; Beilstein:17705; PMID:11209966: 
PMID:9132918; PMID:11410911; PMID:16709440; PMID:11014293; PMID:18625110; Gmelin-103040; MetaCyc:1-3-7-TRIMETHYLXANTHINE; PMID:19879252; 
KEGG:C07481; PMID:12457274; PMID:10803761; PMID:19088793; HMDB:HMDB0001847; PMID:7689104; PMID:14607010; KEGG:D00528; PMID:16143823; 


PMID:11949272; Dr 


:D800201: PMID:15280431: PMID:10834512; PMID:17387608; PMID:16856769; PMID:19084078: PMID:16644114; PMID:10924888; 


PMID:10796597; PMID:11022879; LINCS:LSM-2026; PMID:10510174; PMID:16805851: PMID:8347173; PDBeChem:CFF; PMID:7441110; PMID:16391865; 
PMID:9218278; PMID:15840517; PMID:9067318; PMID:18258404; Cd 463; PMID:19418355; PMID:17508167; PMID:17724925; PMID:12574990; 


PMID:10983026; PMID:15718055; Reaxys:17705; PMID:19007524; Wikipedia:Caffeine; 


; PMID:9063686; PMID:18647558; PMID:18068204; CAS:58-08-2: 


PMID:17132260; PMID:20470411; PMID:8332255; PMID: 11312039; PMID:15681408; PMID-17932622: PMID:19047957; PMID:12915014 


has alternative id: CHEBI-22982. CHEBI41472, CHEBI 3295 


has obo namespace: chebi ontology 
has related synonym: Them. guaranine; cafeine: theine: 
irimethyixanhine; antycrous caffeine; 1.3.-Timetytantin 


A ovyobelbabifhange: o 
http;/lpurl.obolibrary.org/obo/chebi/formula: C8H10N402 


has exact synonym: CAFFEINE: Caffeine: 1,3,7-trimethy!-3,7-dinydro-1H-purine-2,6-cione; caffeine 
; 1-methyltheobromine; 1.3.7-trimethyt-2,6-dioxopurine; ences nee make it 13,7- 
7-methyttheophylline; Coffein: -trimethytpurine-2.6-dione; mateina; methyttheobromine: 


cafeina; 1,3,7- 


RON InCht=1S/C8H10N402!1c1-: Bp a tal 1-3H3 


http: 
http:;//purl.obolibrary.org/obo/chebi/monoisotopicmass: 194 080 
:Iipurt.obolibrary.org/obo/chebi/smiles: 


AI 12 


Fig. 5.2 Class description of caffeine in ChEBI (Source: http://www.ontobee.org/) 


Superclasses & Asserted Axioms 
* muscle tissue disease 
* autosomal dominant disease 
* has material basis in some autosomal dominant inheritance 


Fig. 5.3 Related classes of malignant hyperthermia in the Human Disease Ontology (Source: http://www.ontobee. 


org/) 


A graphical visualization of this property is 
depicted in Fig. 5.5. 


URIs and Labels 


In the previous examples, we searched the OWL 
file using labels and URIs. To standardize the 
process, we will create two scripts that will con- 
vert a label into a URI and vice-versa. The idea 
is to perform all the internal ontology processing 
using the URIs and in the end convert them to 
labels, so we can use them in text processing. 


URI of a Label 


To get the URI of malignant hyperthermia, we 
can use the following query: 


s xmllint --xpath "//x*[local- 
name()='label' and text() 
-'malignant hyperthermia 
!'1/../0* [local-name ()=" 
about ']" doid.owl 


We added the Ox [local -name () =' about 
' ] to extract the URI specified as an attribute of 
that class. 

The output will be the name of the attribute 
and its value: 


Synonyms 


Superclasses & Asserted Axioms 


has role some ryanodine receptor agonist 
has role some adenosine receptor antagonist 
has role some ryanodine receptor modulator 


has role some adenosine A2A receptor antagonist 
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Fig. 5.4 Related classes of caffeine in ChEBI (Source: http://www.ontobee.org/) 


ObjectProperty: has role 


Term IRI: http://purl.obolibrary.org/obo/RO 0000087 
Annotations 


database cross reference: RO:0000087 

has obo namespace: chebi ontology 
http://www.geneontology.org/formats/oboInOwtsid: has role 
http://www.geneontology.org/formats/obolnOwi£is cyclic: false 
http://www.geneontology.org/formats/obolnOwizis transitive: false 
shorthand: has role 


Fig. 5.5 Description of has role property (Source: http://www.ontobee.org/) 


rdf:about="http://purl. 
obolibrary.org/obo/DOID 8545" 


To extract only the value, we can add the 
string function to the XPath query: 


s xmllint --xpath "string(//«[ 
local-name()='label' and 
text () ='malignant 
hyperthermia']/../ex [local 
-name()='about'])" doid. 
owl 


Unfortunately, the st ring function returns only 
one attribute value, even if many are matched. 
Nonetheless, we use the string function be- 
cause we assume that malignant hyperthermia is 
an unambiguous label, i.e. only one class will 
match. 

The output will now be only the attribute 
value: 


http://purl.obolibrary.org/obo/ 
DOID 8545 


To get the URI of caffeine is just about the 
same command: 


s xmllint --xpath "string(//«[ 
local-name()='label' and 
text ()='caffeine']/../@«[ 
local-name()='about'])" 
chebi_lite.owl 


We can now write a script that receives mul- 
tiple labels given as standard input and the OWL 
file where to find the URIs as argument. Thus, 
we can create the script named geturi.sh with the 
following lines: 


i OWLFILE-$1 
2 xargs -I {} xmllint --xpath 
"//*x [local-name()='label' 


and 
3 text ()='{}']/../@« [local- 
name 
4 ()='about']" SOWLFILE | \ 


1 An D 
Again we cannot forget to save the file in our 
working directory, and add the right permissions 
using chmod as we did with our scripts in the 
previous chapters. The xargs command is used 
to process each line of the standard input. The tr 


& tr | grep 'http' 
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command was added because xml lint displays 
all the matches in the same line, so we split the 
output using the character delimiting the URI, i.e. 
", Then we use the grep command to keep only 
the lines with a URI, i.e. the ones that contain the 
term http. 

Now to execute the script we only need to 
provide the labels as standard input: 


s echo 'malignant hyperthermia' 
| ./geturi.sh doid.owl 
s echo 'caffeine' | ./geturi.sh 


chebi lite.owl 


The output should be the URIs of those 
classes: 


http://purl.obolibrary.org/obo/ 
DOID 8545 
http://purl.obolibrary.org/obo/ 


CHEBI 27732 


We can also execute the script using multiple 
labels, one per line: 


$ echo -e 'malignant 
hyperthermia\nmuscle 
tissue disease' | ./geturi 
.sh doid.owl 

s echo -e 'caffeine\npurine 
alkaloid\ 
ntrimethylxanthine' | ./ 
geturi.sh chebi lite.owl 


The output will be a URI for each label: 
http://purl.obolibrary.org/obo/ 


DOID 8545 
http://purl.obolibrary.org/obo/ 
DOID 66 


http://purl.obolibrary.org/obo/ 
CHEBI 27732 

http://purl.obolibrary.org/obo/ 
CHEBI 26385 

http://purl.obolibrary.org/obo/ 
CHEBI 27134 


Label of a URI 


To get the label of the disease entry with the 
identifier 8545, we can also use the xmllint 
command: 


5 Semantic Processing 


s xmllint --xpath "//x[local- 
name ()='Class'] [Ox [local - 
name () ='about']='http:// 
purl.obolibrary.org/obo/ 
DOID_8545']/x[local-name () 
-'label']/text()" doid.owl 


We added the e« [1local-name ()='label 
'] to select the element within the class that 
describes the label. 

The output should be the label we were ex- 
pecting: 
malignant hyperthermia 


We can do the same to get the label of the 
compound entry with the identifier 27732: 


s xmllint --xpath "//x*[local- 
name ()='Class'] [Ox [local - 
name () ='about ']='http:// 
purl.obolibrary.org/obo/ 
CHEB1_27732'1/x* [local-name 
()2'label']/text()" 
chebi lite.owl 

Again, the output should be the label we were 
expecting: 

caffeine 

We can now write a script that receives mul- 
tiple URIs given as standard input and the OWL 
file where to find the labels. We can create a script 
named getlabels.sh with the following lines: 

1 OWLFILE=$1 

2 xargs -I () xmllint --xpath 
"//x [local-name ()='"Class 


1] [ox [local -name () =' about 
1]="()']/x* [local-name () =' 
label']" SOWLFILE | \ 

$ tr t» Iyn IA 

4 grep -v -e ':label' -e '^$' 


The xargs command is used to process each 
line of the standard input. The text function 
does not add a newline character after each 
match, so if we have multiple matches is almost 
impossible to separate them. This explains why 
we removed the text function from the XPath. 
Then we have to split the result in multiple lines 
using the tr command and filtering the lines that 
contain the : label keyword or are empty. 


Synonyms 


Now to execute the script we only need to 
provide the URIs as standard input: 


s echo 'http://purl.obolibrary. 
org/obo/DOID 8545' | ./ 
getlabels.sh doid.owl 

s echo 'http://purl.obolibrary. 


org/obo/CHEBI 27732' | ./ 
getlabels.sh chebi lite. 
owl 


The output should be the labels of those 
classes: 


malignant hyperthermia 
caffeine 


We can also execute the script with multiple 
URIs: 


s echo -e 'http://purl. 
obolibrary.org/obo/ 
DOID 8545Mnhttp://purl. 
obolibrary.org/obo/DOID 66 
' | ./getlabels.sh doid. 
owl 

s echo -e 'http://purl. 
obolibrary.org/obo/ 
CHEBI_27732\nhttp://purl. 
obolibrary.org/obo/ 
CHEBI 26385\nhttp://purl. 
obolibrary.org/obo/ 


$ CHEBI_27134' | ./getlabels. 
sh 
$ chebi_lite.owl 


The output will be a label for each URI: 


malignant hyperthermia 
muscle tissue disease 


caffeine 
purine alkaloid 
trimethylxanthine 


To test both scripts, we can feed the output of 
one as the input of the other, for example: 


s echo -e 'malignant 
hyperthermia\nmuscle 
tissue disease' | ./geturi 
.sh doid.owl | ./getlabels 
.sh doid.owl 
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s echo -e 'caffeine\npurine 
alkaloid\ 
ntrimethylxanthine' | ./ 
geturi.sh chebi lite.owl 


| ./getlabels.sh chebi lite.owl 


The output will be the original input, i.e. the 
labels given as arguments to the echo command: 


malignant hyperthermia 
muscle tissue disease 


caffeine 
purine alkaloid 
trimethylxanthine 


Now we can use the URIs as input: 


$ echo -e 'http://purl. 
obolibrary.org/obo/ 
DOID 8545Mnhttp://purl. 
obolibrary.org/obo/DOID 66 
' | ./getlabels.sh doid. 
owl | ./geturi.sh doid.owl 

s echo -e 'http://purl. 
obolibrary.org/obo/ 
CHEBI_27732\nhttp://purl. 
obolibrary.org/obo/ 
CHEBI_26385\nhttp://purl. 
obolibrary.org/obo/ 


CHEBI 27134' | ./getlabels 
.sh 
chebi_lite.owl | ./geturi. 


sh 
chebi lite.owl 


Again the output will be the original input, 
ie. the URIs given as arguments to the echo 
command: 


http://purl.obolibrary.org/obo/ 
DOID 8545 
http://purl.obolibrary.org/obo/ 
DOID 66 
http://purl.obolibrary.org/obo/ 
CHEBI 27732 
http://purl.obolibrary.org/obo/ 
CHEBI 26385 
http://purl.obolibrary.org/obo/ 


CHEBI 27134 
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Synonyms 


Concepts are not always mentioned using the 
same official label. Frequently, we can find 
in text alternative labels. This is why some 
of the classes also specify alternative labels, 
such as the ones represented by the element 
hasExactSynonym. 

For example, to find all the synonyms of a 
disease, we can use the same XPath as used 
before but replacing the keyword label by 
hasExactSynonym: 


s xmllint --xpath "//x[local- 
name ()='Class'] [Ox [local- 
name () ='about ']='http:// 
purl.obolibrary.org/obo/ 
DOID_8545']/x* [local-name () 
='hasExactSynonym']" doid. 
owl 


The output will be the two synonyms of ma- 
lignant hyperthermia: 


«oboInOwl:hasExactSynonym 
rdf:datatype-"http://www.w3. 
org/2001/XMLSchema#string"> 
anesthesia related 
hyperthermia</ 
oboInOwl:hasExactSynonym» 

«oboInOwl:hasExactSynonym 
rdf :datatype="http://www.w3. 
org/2001/XMLSchema#string"> 
malignant hyperpyrexia due to 
anesthesia«c/oboInOwl: 
hasExactSynonym> 


We can also get both the primary label and 
the synonyms. We only need to add an alternative 
match to the keyword label: 


1 xmllint --xpath "//x*[local- 
name ()='Class'] [Ox [local - 
name () =' about ']='http:// 
purl.obolibrary.org/obo/ 
DOID_8545']/x[local-name () 
='hasExactSynonym' or 
local-name ()='label']" 
doid.owl 


The output will include now the two syn- 
onyms plus the official label: 


5 Semantic Processing 


«oboInOwl:hasExactSynonym 
rdf:datatype-"http://www.w3. 
org/2001/XMLSchema#string"> 
anesthesia related 
hyperthermia</ 
oboInOwl :hasExactSynonym> 

«oboInOwl:hasExactSynonym 
rdf:datatype-"http://www.w3. 
org/2001/XMLSchema#string"> 
malignant hyperpyrexia due to 

anesthesia</ 
oboInOwl :hasExactSynonym> 

«rdfs:label rdf:datatype-"http: 
/ / www .w3.0rg/2001/XMLSchemafs 
string">malignant 
hyperthermia</rdfs:label> 


Thus, we can now update the script getla- 
bels.sh to include synonyms: 


1 OWLFILE=$1 
2 xargs -I {} xmllint --xpath 


"//x [local-name()='Class 
'] [ex [local-name () =' about 
']='{}']/«* [local-name () =" 
hasExactSynonym' or local- 
name () ='hasRelatedSynonym' 
or local-name()='label']" 
SOWLFILE | \ 
$ tr ee PVF CUN 
4 grep -v -e ':label' -e ': 


hasExactSynonym' -e ' 
hasRelatedSynonym' -e '^$' 
We should note that the XPath query and the 
grep command were modified by adding the 
hasExactSynonym keyword. We also added 
the hasRelatedSynonym which is available 
for some classes. 

We can test the script exactly in the same way 
as before: 


$ echo -e 'http://purl. 
obolibrary.org/obo/ 
DOID 8545' | ./getlabels. 
sh doid.owl 


But now the output will display multiple labels 
for this class: 


anesthesia related hyperthermia 


Parent Classes 


malignant hyperpyrexia due to 
anesthesia 
malignant hyperthermia 


URI of Synonyms 


Since the script now returns alternative labels, 
we may encounter some problems if we send the 
output to the geturi.sh script: 


s echo 'http://purl.obolibrary. 
org/obo/DOID 8545' | ./ 
getlabels.sh doid.owl | ./ 
geturi.sh doid.owl 


The previous command will display XPath 
warnings for the two synonyms: 


XPath set is empty 

XPath set is empty 

http://purl.obolibrary.org/obo/ 
DOID 8545 


If we do not want to know about these mis- 
matches, we can always redirect them to the null 
device: 


s echo 'http://purl.obolibrary. 


org/obo/DOID 8545' | ./ 
getlabels.sh doid.owl | ./ 
geturi.sh doid.owl 2>/dev/ 
null 


However, we can update the script geturi.sh to 
also include synonyms: 


i OWLFILE-$1 

2 xargs -I {} xmllint --xpath 
"//«[(l1ocal-name()-' 
hasExactSynonym' or local- 
name ()='hasRelatedSynonym' 
or local-name()='label') 
and text()='{}']/../@«[ 
local-name()='about']" 
SOWLFILE | \ 

'\n' | grep 'http' 


Now we can execute the same command: 


3 tr tut 


s echo 'http://purl.obolibrary. 
org/obo/DOID 8545' | ./ 
getlabels.sh doid.owl | ./ 
geturi.sh doid.owl 
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Every label should now be matched exactly 
with the same class: 


http://purl.obolibrary.org/obo/ 
DOID 8545 

http://purl.obolibrary.org/obo/ 
DOID 8545 

http://purl.obolibrary.org/obo/ 
DOID 8545 


If we want to avoid repetitions, we can add the 
sort command with the -u option to the end of 
each command, as we did in previous chapters: 


$ echo 'http://purl.obolibrary. 


org/obo/DOID 8545! | ./ 
getlabels.sh doid.owl | ./ 
geturi.sh doid.owl | sort 
-u 


The output should now be only one URI: 


http://purl.obolibrary.org/obo/ 
DOID 8545 


Parent Classes 


Parent classes represent generalizations that may 
also be relevant to recognize in text. To extract all 
the parent classes of malignant hyperthermia, we 
can use the following XPath query: 


s xmllint --xpath "//x*[local- 
name ()2'Class'][G«[local- 
name ()='about!]='http:// 
purl.obolibrary.org/obo/ 
DOID 8545'] /« [local-name () 
='subClass0f']/ex [local- 
name () ='resource']" doid. 
owl 


The first part of the XPath is the same as the 
above to get the class element, then [local- 
name ()='subClassOf'] is used to get the 
subclass element, and finally Ox [10ca1-name 
()2'resource'] is used to get the attribute 
containing its URI. 

The output should be the URIs representing 
the parents of class 8545: 


rdf:resource-"http://purl. 
obolibrary.org/obo/ 
DOID 0050736" 
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rdf:resource="http://purl. 
obolibrary.org/obo/DOID 66" 


We can also execute the same command for 
caffeine: 


s xmllint --xpath "//x*[local- 
name ()='Class'] [Ox [local - 
name () ='about ']='http:// 
purl.obolibrary.org/obo/ 
CHEBI_27732'] /* [local-name 
()='subClassOf']/@« [local- 
name () ='resource']" 
chebi_lite.owl 


The output will now include two parents: 


rdf :resource="http://purl. 
obolibrary.org/obo/ 
CHEBI_26385" 

rdf :resource="http://purl. 
obolibrary.org/obo/ 
CHEBI 27134" 


We should note that we no longer can use 
the string function, because ontologies are 
organized as DAGs using multiple inheritance, 
i.e. each class can have multiple parents, and the 
string function only returns the first match. 
To get only the URIs, we can apply the previous 
technique of using the tr and grep commands: 


s xmllint --xpath "//x*[local- 
name ()='Class'] [Ox [local - 
name () =' about ']='http:// 
purl.obolibrary.org/obo/ 
CHEBI_27732'] /* [local-name 
()="subClassof ']/ex [local- 
name () ='resource']" 
chebi_lite.owl tr- UY y 
n' | grep 'http' 


Now the output only contains the URIs: 


http://purl.obolibrary.org/obo/ 
CHEBI 26385 
http://purl.obolibrary.org/obo/ 


CHEBI 27134 


We can now create a script that receives mul- 
tiple URIs given as standard input and the OWL 
file where to find all the parents as argument. 


5 Semantic Processing 


The script named getparents.sh should contain 
the following lines: 


1 OWLFILE=$1 
2 xargs -I () xmllint --xpath 
"//x[local-name ()='Class 
'] [ex [local-name () =' about 
']='{}']/«* [local-name () =" 
subClassOf'] /@« [local -name 
() ='resource']" SOWLFILE | 
X 
MUNI 1 Mn! 
To get the parents of malignant hyperthermia, we 
will only need to give the URI as input and the 
OWL file as argument: 


* tr | grep 'http' 


s echo 'http://purl.obolibrary. 
org/obo/DOID_8545' | ./ 
getparents.sh doid.owl 


The output will include the URIs of the two 
parents: 


http://purl.obolibrary.org/obo/ 
DOID 0050736 

http://purl.obolibrary.org/obo/ 
DOID 66 


Labels of Parents 


But if we need the labels we can redirect the 
output to the getlabels.sh script: 


s echo 'http://purl.obolibrary. 
org/obo/DOID 8545' | ./ 
getparents.sh doid.owl | 
./getlabels.sh doid.owl 


The output will now be the label of the parents 
of malignant hyperthermia: 


autosomal dominant disease 
muscle tissue disease 


Again, the same can be done with caffeine: 


s echo 'http://purl.obolibrary. 
org/obo/CHEBI 27732' | ./ 
getparents.sh chebi lite. 
owl | ./getlabels.sh 
chebi lite.owl 


Parent Classes 


And now the output contains the labels of the 
parents of caffeine: 


purine alkaloid 
trimethylxanthine 


Related Classes 


If we are interested in using all the related classes 
besides the ones that represent a generalization 
(subClassOf), we have to change our XPath to: 


s xmllint --xpath "//x*[local- 
name ()='Class'] [Ox [local- 
name () =' about ']='http:// 
purl.obolibrary.org/obo/ 
CHEB1I_27732']/x* [local-name 
()='subClassOf']//* [local- 
name ()='someValuesFrom'] /@ 
*[local-name()-2'resource 
']" chebi lite.owl | tr 
mro\n'| grep 'http' 


We should note that these related classes are in 
the attribute resource of someValuesFrom ele- 
ment inside a subClassOf element. 

The URIs of the 18 related classes of caffeine 
are now displayed: 


http://purl.obolibrary.org/obo/ 
CHEBI 25435 
http://purl.obolibrary. 
CHEBI 35337 
http://purl.obolibrary. 
CHEBI 35471 
http://purl.obolibrary. 
CHEBI 35498 
http://purl.obolibrary. 
CHEBI 35703 
http://purl.obolibrary. 
CHEBI 38809 


org/obo/ 
org/obo/ 
org/obo/ 
org/obo/ 


org/obo/ 


http://purl.obolibrary.org/obo/ 
CHEBI 50218 
http://purl.obolibrary.org/obo/ 
CHEBI 50925 
http://purl.obolibrary.org/obo/ 
CHEBI 53121 
http://purl.obolibrary.org/obo/ 


CHEBI 60809 
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http://purl.obolibrary.org/obo/ 
http: //purl.obolibrary.org/obo/ 
http: //purl.obolibrary.org/obo/ 
http://purl.obolibrary.org/obo/ 
http: //purl.obolibrary.org/obo/ 
http: //purl.obolibrary.org/obo/ 


http: //purl.obolibrary.org/obo/ 


http: //purl.obolibrary.org/obo/ 


Labels of Related Classes 


To get the labels of these related classes, we only 
need to add the getlabels.sh script: 


s xmllint --xpath "//x[local- 
name ()='Class'] [Ox [local- 
name () ='about']='http:// 
purl.obolibrary.org/obo/ 
CHEBI 27732']/»[local-name 
()2'subClassOf']//x»[local- 


name () ='someValuesFrom']/e 
x [local-name () ='resource 
']" chebi_lite.owl | tr 
br Ens | grep tbe’: [ sf 
getlabels.sh chebi lite. 
owl 


The output is now 18 terms that we could use 
to expand our text processing: 


mutagen 

central nervous system stimulant 

psychotropic drug 

diuretic 

xenobiotic 

ryanodine receptor modulator 

EC 3.1.4.« (phosphoric diester 
hydrolase) inhibitor 

EC 2.7.11.1 (non-specific serine 
/threonine protein kinase) 
inhibitor 
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adenosine A2A receptor 
antagonist 

adjuvant 

food additive 

ryanodine receptor agonist 

adenosine receptor antagonist 

mouse metabolite 

plant metabolite 

fungal metabolite 

environmental contaminant 

human blood serum metabolite 


Ancestors 


Finding all the ancestors of a class includes many 
chain invocations of the getparents.sh until we 
get no matches. We also should avoid relations 
that are cyclic, otherwise we will enter in a 
infinite loop. Thus, for identifying the ancestors 
of a class, we will only consider parent relations, 
i.e. subsumption relations. 


Grandparents 


In the previous section we were able to extract 
the direct parents of a class, but the parents of 
these parents also represent generalizations of the 
original class. For example, to get the parents of 
the parents (grandparents) of malignant hyper- 
thermia we need to invoke getparents.sh twice: 


$ echo 'malignant hyperthermia' 
| ./geturi.sh doid.owl | 
./getparents.sh doid.owl | 
./getparents.sh doid.owl 


And we will find the URIs of the grandparents 
of malignant hyperthermia: 


http://purl.obolibrary.org/obo/ 
DOID 0050739 
http://purl.obolibrary.org/obo/ 


DOID 0080000 


Or to get their labels we can add the getla- 
bels.sh script: 


5 Semantic Processing 


s echo 'malignant hyperthermia' 
| ./geturi.sh doid.owl | 
./getparents.sh doid.owl | 

./getparents.sh doid.owl 
| ./getlabels.sh doid.owl 


And we find the labels of the grandparents of 
malignant hyperthermia: 


autosomal genetic disease 
muscular disease 


Root Class 


However, there are classes that do not have any 
parent, which are called root classes. In Figs. 5.1 
and 5.2, we can see that disease and chemical 
entity are root classes of DO and ChEBI ontolo- 
gies, respectively. As we can see these are highly 
generic terms. 

To check if it is the root class, we can ask for 
their parents: 


s echo 'disease' | ./geturi.sh 
doid.owl | ./getparents.sh 
doid.owl 
s echo 'chemical entity' | ./ 


geturi.sh chebi lite.owl | 
./getparents.sh 
chebi lite.owl 


In both cases, we will get the warning that no 
matches were found, confirming that they are the 
root class. 


XPath set is empty 


Recursion 


We can now build a script that receives a list 
of URIs as standard input, and invokes getpar- 
ents.sh recursively until it reaches the root class. 

The script named getancestors.sh should con- 
tain the following lines: 


1 OWLFILE-$1 
2 CLASSES=$ (cat -) 

[[ -z "SCLASSES" ]] && exit 
PARENTS=$ (echo "$CLASSES" | ./ 
getparents.sh SOWLFILE | 

sort -u) 


w 


e 


Ancestors 


5 echo "SPARENTS" 
€ echo "SPARENTS" | ./ 
getancestors.sh SOWLFILE 


The second line of the script saves the stan- 
dard input in a variable named CLASSES, be- 
cause we need to use it twice: (1) to check if the 
input as any classes or is empty (third line) and 
(ii) to get the parents of the classes given as input 
(fourth line). If the input is empty then the script 
ends, this is the base case of the recursion?. This 
is required so the recursion stops at a given point. 
Otherwise, the script would run indefinitely until 
the user stops it manually. 

The fourth line of the script stores the output 
in a variable named PARENTS, because we need 
also to use it twice: (1) to output these direct 
parents (fifth line), and (ii) to get the ancestors 
of this parents (sixth line). We should note that 
we are invoking the getancestors.sh script inside 
the getancestors.sh, which defines the recursion 
step. Since the subsumption relation is acyclic, 
we expect that at some time we will reach classes 
without parents (root classes) and then the script 
will end. 

We should note that the echo of the variables 
CLASSES and PARENTS need to be inside com- 
mas, so the newline characters are preserved. 


Iteration 


Recursion is most of the times computational 
expensive, but usually it is possible to replace 
recursion with iteration to develop a more effi- 
cient algorithm. Explaining iteration and how to 
refactor a recursive script is out of scope of this 
book, nevertheless the following script represents 
an equivalent way to get all the ancestors without 
using recursion: 


i # iteration 

2 OWLFILE-$1 

3 CLASSES=$ (cat 
4 ANCESTORS="" 

5 while [[ ! 
6 do 


-) 


"SCLASSES" ]] 


=Z 


?https:;//en.wikipedia.org/wiki/Recursion 
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7 PARENTS=$ (echo "SCLASSES" | 
./getparents.sh SOWLFILE 
| sort -u) 
8 ANCESTORS="SANCESTORS \ 
n$PARENTS" 
9 CLASSES-$PARENTS 
10 done 
11 echo -e "SANCESTORS" 


The script uses the while command that basi- 
cally implements iteration by repeating a set of 
commands (lines 6-8) while a given condition is 
satisfied (line 4). 

To test the recursive script, we can provide as 
standard input the label malignant hyperthermia: 


$ echo 'http://purl.obolibrary. 
org/obo/DOID_8545' | ./ 
getancestors.sh doid.owl 


The output will be the URIs of all its ances- 
tors: 


http://purl.obolibrary.org/obo/ 
DOID 0050736 
http://purl.obolibrary.org/obo/ 
DOID 66 
http://purl.obolibrary.org/obo/ 
DOID 0050739 
http://purl.obolibrary.org/obo/ 
DOID 0080000 
http://purl.obolibrary.org/obo/ 
DOID 0050177 


http://purl.obolibrary.org/obo/ 
DOID 17 

http://purl.obolibrary.org/obo/ 
DOID 630 

http://purl.obolibrary.org/obo/ 
DOID 7 

http://purl.obolibrary.org/obo/ 
DOID 4 


We should note that we will still receive the 
XPath warning when the script reaches the root 
class and no parents are found: 


XPath set is empty 


To remove this warning and just get the labels 
of the ancestors of malignant hyperthermia, we 
can redirect the warnings to the null device: 


s echo 'malignant hyperthermia' 
| ./geturi.sh doid.owl | 
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./getancestors.sh doid.owl 
2>/dev/null | ./getlabels 
.sh doid.owl 


The output will now include the name of all 
ancestors of malignant hyperthermia: 


autosomal dominant disease 
muscle tissue disease 
autosomal genetic disease 
muscular disease 

monogenic disease 
musculoskeletal system disease 
genetic disease 

disease of anatomical entity 
disease 


We should note that the first two ancestors are 
the direct parents of malignant hyperthermia, and 
the last one is the root class. This happens be- 
cause the recursive script print the parents before 
invoking itself to find the ancestors of the direct 
parents. 

We can do the same with caffeine, but be 
advised that given the higher number of ancestors 
in ChEBI we may now have to wait a little longer 
for the script to end. 


s echo 'caffeine' | ./geturi.sh 
chebi lite.owl | ./ 
getancestors.sh chebi lite 
.owl | ./getlabels.sh 
chebi lite.owl | sort -u 


The results include repeated classes that were 
found by using different branches, so that is why 
we need to add the sort command with the -u 
option to eliminate the duplicates. 

The script will print the ancestors being found 
by the script: 


alkaloid 

aromatic compound 

bicyclic compound 

carbon group molecular entity 
chemical entity 

cyclic compound 

heteroarene 

heterobicyclic compound 
heterocyclic compound 
heteroorganic entity 
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heteropolycyclic compound 

imidazopyrimidine 

main group molecular entity 

methylxanthine 

molecular entity 

molecule 

nitrogen molecular entity 

organic aromatic compound 

organic cyclic compound 

organic heterobicyclic compound 

organic heterocyclic compound 

organic heteropolycyclic 
compound 

organic molecular entity 

organic molecule 

organonitrogen compound 

organonitrogen heterocyclic 
compound 

p-block molecular entity 

pnictogen molecular entity 

polyatomic entity 

polycyclic compound 

purine alkaloid 

purines 

trimethylxanthine 


My Lexicon 


Now that we know how to extract all the labels 
and related classes from an ontology, we can 
construct our own lexicon with the list of terms 
that we want to recognize in text. 

Let us start by creating the file do 8545 
lexicon.txt representing our lexicon for malignant 
hyperthermia with all its labels: 


s echo 'malignant hyperthermia' 
| ./geturi.sh doid.owl | 
./getlabels.sh doid.owl » 
do 8545 lexicon.txt 


Ancestors Labels 


Now we can add to the lexicon all the labels of the 
ancestors of malignant hyperthermia by adding 
the redirection operator: 


My Lexicon 


s echo 'malignant hyperthermia' 
| ./geturi.sh doid.owl | 
./getancestors.sh doid.owl 

| ./getlabels.sh doid.owl 
>> do 8545 lexicon.txt 


We should note that now we use >> and not >, 
this will append more lines to the file instead of 
creating a new file from scratch. 

Now we can check the contents of the file 
do. 8545 lexicon.txt to see the terms we got: 


s cat do 8545 lexicon.txt | sort 
-u 


We should note that we use the sort command 
with the -u option to eliminate any duplicates 
that may exist. 

We should be able to see the following labels: 


anesthesia related hyperthermia 

autosomal dominant disease 

autosomal genetic disease 

disease 

disease of anatomical entity 

genetic disease 

malignant hyperpyrexia due to 
anesthesia 

malignant hyperthermia 

monogenic disease 

muscle tissue disease 

muscular disease 

musculoskeletal system disease 


We can also apply the same commands 
for caffeine to produce its lexicon in the 


file chebi 27732 lexicon.txt by adding the 

redirection operator: 

s echo 'caffeine' | ./geturi.sh 
chebi lite.owl | ./ 


getlabels.sh chebi lite. 
owl » chebi 27732 lexicon. 


txt 
s echo 'caffeine' | ./geturi.sh 
chebi lite.owl | ./ 


getancestors.sh chebi lite 
.owl | ./getlabels.sh 
chebi lite.owl >> 

chebi 27732 lexicon.txt 


We should note that it may take a while until 
it gets all labels. 
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Now let us check the contents of this new 
lexicon: 


s cat chebi 27732 lexicon.txt | 
sort -u 


Now we should be able to see that this lexicon 
is much larger: 


alkaloid 
aromatic 
bicyclic 
caffeine 


compound 
compound 


Merging Labels 


If we are interested in finding everything related 
to caffeine or malignant hyperthermia, we may 
be interested in merging the two lexicons in a file 
named lexicon. txt: 


s cat do 8545 lexicon.txt 
chebi 27732 lexicon.txt | 
sort -u » lexicon.txt 


Using this new lexicon, we can recognize 
any mention in our previous file named 
chebi 27732 sentences.txt: 


$ grep -w -i -F -f lexicon.txt 
chebi 27732 sentences.txt 


We added the -F option because our lexicon is a 
list of fixed strings, i.e. does not include regular 
expressions. The equivalent long form to the -F 
option is -- £ixed-strings. 

We now get more sentences, including some 
that do not include a direct mention to caffeine 
or malignant hyperthermia. For example, the 
following sentence was selected because it 
mentions molecule, which is an ancestor of 
caffeine: 


The remainder of the molecule is 
hydrophilic and presumably 
constitutes the cytoplasmic 
domain of the protein. 


Another example is the following sentence, 
which was selected because it mentions disease, 
which is an ancestor of malignant hyperthermia: 
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Our data suggest that divergent 
activity profiles may cause 
varied disease phenotypes by 
specific mutations. 


We can also use our script getentities.sh giving 
this lexicon as argument. However, since we are 
not using any regular expressions it would be 
better to add the -F option to the grep command 
in the script, so the lexicon is interpreted as list 
of fixed strings to be matched. Only then we can 
execute the script safely: 


s ./getentities.sh lexicon.txt « 
chebi 27732 sentences.txt 


Ancestors Matched 


Besides these two previous examples, we can 
check if there other ancestors being matched by 
using the grep command with the -o option: 


$ grep -o -w -F -f lexicon.txt 
chebi_27732_sentences.txt 


| sort -u 


We can see that besides the terms caffeine 
and malignant hyperthermia, only one ancestor 
of each one of them was matched, molecule and 
disease, respectively: 


caffeine 
disease 
malignant hyperthermia 
molecule 


This can be explained because our text is 
somehow limited and because we are using the 
official labels and we may be missing acronyms, 
and simple variations such as the plural of a term. 
To cope with this issue, we may use a stemmer'?, 
or use all the ancestors besides subsumption. 
However, if our lexicon is small is better to do 
it manually and maybe add some regular expres- 


sions to deal with some of the variations. 


!Ohttps://en.wikipedia.org/wiki/Stemming 
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Generic Lexicon 


Instead of using a customized and limited lex- 
icon, we may be interested in recognizing any 
of the diseases represented in the ontology. By 
recognizing all the diseases in our caffeine related 
text, we will be able to find all the diseases that 
may be related to caffeine 


All Labels 


To extract all the labels from the disease ontology 
we can use the same XPath query used before, but 
now without restricting it to any URI: 


s xmllint --xpath "//x*[local- 
name ()='Class']/* [local- 
name () ='hasExactSynonym' 
or local-name()-' 


hasRelatedSynonym' or 
local-name()='label']" 
doid.owl 


We can create a script named getalllabels.sh, 
that receives as argument the OWL file where to 
find all labels containing the following lines: 


i OWLFILE-$1 

2 xmllint --xpath "//x*[local- 
name ()='Class']/* [local- 
name () ='hasExactSynonym' 
or local-name()-' 
hasRelatedSynonym' or 
local-name()='label']" 
SOWLFILE | \ 

à. tr "An! | \ 

4 grep -v -e ':label' -e ': 
hasExactSynonym' -e ' 
hasRelatedSynonym' -e '^$'! 

= 


5 sort -u 


Yes! 


We should note that this script is similar to the 
getlabels.sh script without the xargs, since it 
does not receive a list of URIs as standard input. 

Now we can execute the script to extract all 
labels from the OWL file: 


s ./getalllabels.sh doid.owl 


The output will contain the full list of diseases: 


Generic Lexicon 


11-beta-hydroxysteroid 
dehydrogenase deficiency type 
2 
llp partial monosomy syndrome 
1,4-phenylenediamine allergic 
contact dermatitis 


Zoophilia 
Zoophobia 
zygomycosis 


To create the generic lexicon, we can redirect 
the output to the file diseases. txt: 


s ./getalllabels.sh doid.owl > 
diseases.txt 


We can check how many labels we got by 
using the wc command: 


5 we -l diseases.txt 


The lexicon contains more than 29 thousand 
labels. 

We can now recognize the lexicon entries 
in the sentences of the file chebi_27732_ sen- 
tences.txt by using the grep command: 


$ grep -n -w -E -f diseases.txt 


chebi 27732 sentences.txt 
However, we will get the following error: 


grep: Unmatched ) or \) 


This error happens because our lexicon contains 
some special characters also used by regular 
expressions, such as the parentheses. 

One way to address this issue is to replace 
the -E option by the -F option, that treats each 
lexicon entry as a fixed string to be recognized: 


$ grep -n -o -w -F -f diseases. 
txt chebi_27732_sentences. 


txt 


The output will show the large list of sentences 
mentioning diseases: 


1:malignant hyperthermia 
2:malignant hyperthermia 
9:central core disease 
10:disease 

10:myopathy 
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1092:malignant hyperthermia 
1092:central core disease 
1103:malignant hyperthermia 
1104:malignant hyperthermia 
1106:central core disease 
1106 :myopathy 


Problematic Entries 


Despite using the - F option, the lexicon contains 
some problematic entries. Some entries have ex- 
pressions enclosed by parentheses or brackets, 
that represent alternatives or a category: 


Post measles encephalitis ( 
disorder) 

Glaucomatous atrophy 
of optic disc 


[cupping] 


Other entries have separation characters, such 
as commas or colons, to represent a specializa- 
tion. For example: 


Tapeworm infection: intestinal 
taenia solum 

Tapeworm infection: pork 

Pemphigus, Benign Familial 


ATR, nondeletion type 


A problem is that not all have the same mean- 
ing. A comma may also be part of the term. For 
example: 


46,XY DSD due to LHB deficiency 


Other case includes using &amp; to represent 
an ampersand. For example: 


Gonococcal synovitis &amp;/or 
tenosynovitis 


However, most of the times the alternatives are 
already included in the lexicon in different lines. 
For example: 


Gonococcal synovitis and 
tenosynovitis 

Gonococcal synovitis or 
tenosynovitis 


As we can see by these examples, it is not 
trivial to devise rules that fully solve these issues. 
Very likely there will be exceptions to any rule 
we devise and that we are not aware of. 
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Special Characters Frequency 
To check the impact of each of these issues, we 


can count the number of times they appear in the 
lexicon: 


$ grep -c -F '(' diseases.txt 

$ grep -c -F ',' diseases.txt 

$ grep -c -F '[' diseases.txt 

$ grep -c -F ':' diseases.txt 

$ grep -c -F '&amp;' diseases. 
txt 


We will be able to see that parentheses and 
commas are the most frequent, with more than 
one thousand entries. 


Completeness 


Now let us check if the ATR acronym represent- 
ing the alpha thalassemia-X-linked intellectual 
disability syndrome is in the lexicon: 


s grep -E '^ATR' diseases.txt 


All the entries include more terms than only 
the acronym: 


ATR-16 syndrome 

ATR, nondeletion type 

ATR syndrome, deletion type 

ATR syndrome linked to 
chromosome 16 

ATR-X syndrome 


Thus, a single ATR mention will not be recog- 
nized. 

This is problematic if we need to match sen- 
tences mentioning that acronym, such as: 


$ echo 'The ATR syndrome is an 
alpha thalassemia that has 
material basis in 
mutation in the ATRX gene 
on Xq21' | grep -w 'ATR' 


We will now try to mitigate these issues as 
simply as we can. We will not try to solve them 
completely, but at least address the most obvious 
cases. 
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Removing Special Characters 


The first fix we will do, is to remove all the paren- 
theses and brackets by using the tr command, 
since they will not be found in the text: 


$ tr -d '[10()' < diseases.txt 


Of course, we may lose the shorter labels, such 
as Post measles encephalitis, but at least now, the 
disease Post measles encephalitis disorder will 
be recognized: 


s tr -d '[10(]' < diseases.txt 
| grep 'Post measles 
encephalitis disorder' 


If we really need these alternatives, we would 
have to create multiple entries in the lexicon or 
transform the labels in regular expressions. 


Removing Extra Terms 


The second fix is to remove all the text after a 
separation character, by using the sed command: 


s tr -d '[10()' < diseases.txt 
| sed -E 's/[,:;] .*$//' 


We should note that the regular expression en- 
forces a space after the separation character to 
avoid separation characters that are not really 
separating two expressions, such as: 46,XY DSD 
due to LHB deficiency 

We can see that now we are able to recognize 
both ATR and ATR syndrome: 


s tr -d '[](){}' < diseases.txt 
| sed -E 's/[,:;] .«$//' | 
grep -E '^ATR' 


Removing Extra Spaces 


The third fix is to remove any leading or trailing 
spaces of a label: 


s tr -d '[10(]' < diseases.txt 
| sed «E 's/[,:;] .*$//; E 
/^ wf; sf «S/f! 

We should note that we added two more re- 


placement expressions to the sed command by 
separating them with a semicolon. 


Generic Lexicon 


We can now update the script getalllabels.sh 
to include the previous tr and sed commands: 


i OWLFILE-$1 
2 xmllint --xpath "//x*[local- 


name ()='Class']/x* [local- 
name () = 

3 'hasExactSynonym' or local- 

4 name () ='hasRelatedSynonym!' 

or 

5 local-name()='label']" 

6 SOWLFILE | \ 

7 tr <>" '\n' | \ 

8 grep -v -e ':label' -e ': 
hasExactSynonym' -e ' 
hasRelatedSynonym' -e '^$' 

PA 

ə tr d IO IS 

10 sed -E 's/[,:;] .*$//; s/^ 
x//; s/ *$//' | sort -u 


And we can now generate a fixed lexicon: 


s ./getalllabels.sh doid.owl > 
diseases.txt 

We can check again the number of entries: 

s we -l diseases.txt 


We now have a lexicon with about 28 thousand 
labels. We have less entries because our fixes 
made some entries equal to others already in the 
lexicon, and thus the -u option filtered them. 


Disease Recognition 


We can now try to recognize lexicon en- 


tries in the sentences of file chebi 27732 . 


sentences.txt: 


$ grep -n -o -w -F -f diseases. 
txt chebi_27732_sentences. 
txt 


To obtain the list of labels that were recog- 
nized, we can use the grep command: 


$ grep -o -w -F -f diseases.txt 
chebi 27732 sentences.txt 
| sort -u 
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We will get a list of 43 unique labels repre- 
senting diseases that may be related to caffein: 


Andersen-Tawil syndrome 
arrhythmogenic right ventricular 
cardiomyopathy 

ARVD2 

ataxia 

ATR 

atrial fibrillation 

benign congenital myopathy 

cancer 

cardiac arrest 

cardiomyopathy 

catecholaminergic polymorphic 
ventricular tachycardia 

central core disease 

chorea 

congenital hip dislocation 

congenital myopathy 

deficiency 

disease 

dystonia 

epilepsy 

FHL1 

hand 

hepatitis C 

HL 

hypercholesterolaemia 

hypokalemic periodic paralysis 

Hypokalemic periodic paralysis 

intellectual disability 

long QT syndrome 

LOT1 

LOT2 

LOT3 

LOT5 

LOT6 

malignant hyperthermia 

migraine 

myopathy 

myotonic 


telangiectasia 


dystrophy type 1 
nemaline myopathy 
nemaline rod myopathy 
ophthalmoplegia 

rod myopathy 

scoliosis 

syndrome 
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Performance 


The grep is quite efficient but of course when 
using large lexicons and texts we may start to 
feel some performing issues. Its execution time 
is proportional to the size of the lexicon, since 
each term of the lexicon will correspond to an 
independent pattern to match. This means that for 
large lexicons we may face serious performance 
issues. 


Inverted Recognition 


A solution for dealing with large lexicons is to 
use the inverted recognition technique (Couto 
et al. 2017; Couto and Lamurias 2018). The 
inverted recognition uses the words of the input 
text as patterns to be matched against the lexicon 
file. When the number of words in the input text 
is much smaller than the number of terms in the 
lexicon, grep has much fewer patterns to match. 
For example, the inverted recognition technique 
applied to ChEBI has shown to be more than 100 
times faster than using the standard technique. 


Case Insensitive 


Another performance issue arises when we use 
the -i option to perform a case insensitive 
matching. For instance, in most computers if we 
execute the following command, we will have to 
wait much longer than not using the - i option: 
$ grep -n -o -w -F -i -f 
diseases.txt 
chebi 27732 sentences.txt 


One solution is to convert both the lexicon 
and text to lowercase (or uppercase), but this 
may result in more incorrect matches, such as 
incorrectly matching acronyms in lowercase. 


ASCII Encoding 


The low performance issue of case insensitive 
matching is normally due to the usage of UTF-8 
character encoding! !, instead of ASCII character 


V https;//en.wikipedia.org/wiki/UTF-8 
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encoding!?. UTF-8 allow us to use special char- 
acters, such as the euro symbol, in a standard way 
so it is interpreted by every computer around the 
world in the same way. However, for normal text 
without special characters ASCII works fine and 
more efficiently. In Unix shells we can normally 
specify the usage of ASCII encoding by adding 
the expression LC_ALL=C before the command 
(man locale for more information). 

So, another solution is to execute the follow- 
ing command: 


s LC ALL-C grep -n -o -w -F -i - 
f diseases.txt 
chebi 27732 sentences.txt 


We will be able to watch the significant increase 
in performance. 

To check how many labels are now being 
recognized we can execute: 


$ LC ALL-C grep -o -w -F -i -f 
diseases.txt 
chebi_27732_sentences.txt 
| sort -u | we -1 


We have now 60 labels being recognized. 

To check which new labels were recognized, 
we can compare the results with and without the 
-i option: 


$ LC ALL-C grep -o -w -F -i -f 
diseases.txt 
chebi_27732_sentences.txt 
| sort -u > 
diseases_recognized_ignorecase 
.txt 

s grep -o -w -F -f diseases.txt 
chebi 27732 sentences.txt 
| sort -u » 
diseases recognized.txt 

$ grep -v -F -f 
diseases recognized.txt 
diseases recognized 
ignorecase.txt 


We are now able to see that the new labels are: 


Phttps://en.wikipedia.org/wiki/ASCII 
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Arrhythmogenic right ventricular 
dysplasia 

arthrogryposis 

can 

Catecholaminergic polymorphic 

ventricular tachycardia 

Central Core Disease 

defect 

Disease 

dyskinesia 

face 

fever 

Malignant hyperthermia 

Malignant Hyperthermia 

March 

ORF 

total 


Correct Matches 


Some important diseases could only be recog- 
nized by performing a case insensitive match, 
such as arthrogryposis. This disease was missing 
because in the lexicon we had the uppercase 
case version of the labels, but not the lowercase 
version. We can check it by using the grep 
command: 


s grep -i '^arthrogryposis$' 
diseases.txt 


The output does not include the lowercase 
case version: 


Arthrogryposis 
ARTHROGRYPOSIS 


We can also check in the text which versions 
are used: 


$ grep -w -i 'arthrogryposis' 


chebi 27732 sentences.txt 


We can see that only the lowercase version is 
used: 


(multiple arthrogryposis, 
congenital dislocation of the 
hips 

fetal akinesia, 
arthrogryposis multiplex 


Another example is dyskinesia: 


83 


s grep -i '^dyskinesia$' 
diseases.txt 


The lexicon has only the disease name with 
the first character in uppercase: 


Dyskinesia 


Incorrect Matches 


However, using a case insensitive match may also 
create other problems, such as the acronym CAN 
for the disease Crouzon syndrome-acanthosis ni- 
gricans syndrome: 


s grep -i '*CANS' diseases.txt 


By using a case insensitive grep we will 
recognize the common word CAN as a disease. 
For example, we can check how many times CAN 
is recognized: 


$ LC ALL-C grep -n -o -w -i -F - 
f diseases.txt 
chebi_27732_sentences.txt 
| grep -i ':CAN' | we -1 


It is recognized 18 times. 
And to see which type of matches they are, we 
can execute the following command: 


$ LC ALL-C grep -o -w -i -F -f 
diseases.txt 
chebi_27732_sentences.txt 
| grep -i -E '^CAN$' | 
sort -u 

We can verify that the matches are incorrect 
mentions of the disease acronym: 


can 


This means we created at least 18 mismatches by 
performing a case insensitive match. 


Entity Linking 


When we are using a generic lexicon, we may 
be interested in identifying what the recognized 
labels represent. For example, we may not be 
aware of what the matched label AD2 represents. 
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To solve this issue, we can use our script 
geturi.sh to perform linking (aka entity disam- 
biguation, entity mapping, normalization), i.e. 
find the classes in the disease ontology that may 
be represented by the recognized label. For exam- 
ple, to find what AD2 represents, we can execute 
the following command: 


s echo "AD2" | ./geturi.sh doid. 
owl | ./getlabels.sh doid. 
owl 


In this case, the result clearly shows that AD2 
represents the Alzheimer disease: 


AD2 

Alzheimer disease 2, late onset 

Alzheimer disease associated 
with APOE4 

Alzheimer disease-2 

Alzheimer's disease 2 


Modified Labels 


However, we may not be so lucky with the labels 
that were modified by our previous fixes in the 
lexicon. For example, we can test the case of 
ATR: 


s echo "ATR" | 
owl 


./geturi.sh doid. 


As expected, we received the warning that no 
URI was found: 


XPath set is empty 


An approach to address this issue may involve 
keeping a track of the original label in a lexicon 
using another file. 


Ambiguity 


We may also have to deal with ambiguity prob- 
lems where a label may represent multiple terms. 
For example, if we check how many classes the 
acronym ATS may represent: 


s echo "ATS" | 
owl 


./geturi.sh doid. 


We can see that it may represent two classes: 
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http://purl.obolibrary.org/obo/ 
DOID 0050434 

http://purl.obolibrary.org/obo/ 
DOID 0110034 


These two classes represent two distinct diseases, 
namely Andersen-Tawil syndrome and X-linked 
Alport syndrome, respectively. 

We can also obtain their alternative labels by 
providing the two URI as standard input to the 
getlabels.sh script: 


$ echo "http://purl.obolibrary. 
org/obo/DOID 0050434" | ./ 
getlabels.sh doid.owl 

s echo "http://purl.obolibrary. 
org/obo/DOID 0110034" | ./ 
getlabels.sh doid.owl 


We will get the following two lists, both con- 
taining ATS as expected: 


ANDERSEN CARDIODYSRHYTHMIC 
PERIODIC PARALYSIS 

ATS 

Andersen syndrome 

LOT7 

Long QT syndrome 7 

Potassium-Sensitive 
Cardiodysrhythmic Type 

Andersen-Tawil syndrome 


ATS 

nephropathy and deafness, 
linked 

X-linked Alport syndrome 


X- 


If we find a ATS mention in the text, the 
challenge is to identify which of the syndromes 
the mention refers to. For addressing this chal- 
lenge, we may have to use advanced entity link- 
ing techniques that analyze the context of the 
text. 


Surrounding Entities 


An intuitive solution is to select the class closer in 
terms of meaning to the others classes mentioned 
in the surrounding text. This assumes that entities 
present in a piece of text are somehow seman- 
tically related to each other, which is normally 


Entity Linking 


the case. At least the author assumed some type 
of relation between them, otherwise the entities 
would not be in the same sentence. 

Let us consider the following sentence about 
genes and related syndromes from our text file 
chebi_27732_sentences.txt (on line 436): 


channel genes, KCNQ1 (LQT1), 
KCNH2 (LQT2), SCN5A (LQT3), 

KCNE1 (LQT5), and KCNE2 (LQT6 
), along with KCNJ2 (Andersen 


-Tawil syndrome) and 


Now assume that the label Andersen-Tawil syn- 
drome been replaced by the acronym ATS: 


KCNQ1 (LQT1), 
SCN5A (LQT3), 

and KCNE2 (LQT6 
(ATS) and 


channel genes, 
KCNH2 (LQT2), 
KCNE1 (LQT5), 
), along with KCNJ2 


Then, to identify the diseases in the previous 
sentence, we can execute the following com- 
mand: 


$ echo 'channel genes, KCNQ1 ( 
LQT1), KCNH2 (LQT2), SCN5A 
(LQT3), KCNE1 (LQT5), and 
KCNE2 (LQT6), along with 
KCNJ2 (ATS) and' | grep -o 
-w -F -f diseases.txt 


We have a list of labels that can help us decide 
which is the right class representing ATS: 


LOT1 
LOT2 
LOT3 
LQT5 
LOT6 
ATS 


To find their URIs we can use the geturi.sh 
script: 


$ echo 'channel genes, KCNQ1 ( 
LQT1), KCNH2 (LQT2), SCN5A 
(LQT3), KCNE1 (LQT5), and 
KCNE2 (LQT6), along with 
KCNJ2 (ATS) 


and' | grep -o -w -F -f 
diseases.txt | ./geturi.sh 
doid.owl 
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The only ambiguity is for ATS that returns 
two URIs, one representing the Andersen- 
Tawil syndrome (DOID:0050434) and the other 
representing the X-linked Alport syndrome 
(DOID:01 10034): 


http://purl.obolibrary.org/obo/ 
DOID 0110644 
http://purl.obolibrary. 
DOID 0110645 
http://purl.obolibrary. 
DOID 0110646 
http://purl.obolibrary. 
DOID 0110647 
http://purl.obolibrary. 
DOID 0110648 
http://purl.obolibrary. 
DOID 0050434 
http://purl.obolibrary.org/obo/ 
DOID 0110034 


To decide which of the two URIs we should 
select, we can measure how close in meaning 
they are to the other diseases also found in the 
text. 


org/obo/ 
org/obo/ 
org/obo/ 
org/obo/ 


org/obo/ 


Semantic Similarity 


Semantic similarity measures have been 
successfully applied to solve these ambiguity 
problems (Grego and Couto 2013). Semantic 
similarity quantifies how close two classes 
are in terms of semantics encoded in a given 
ontology (Couto and Lamurias 2019). Using the 
web tool Semantic Similarity Measures using 
Disjunctive Shared Information (DiShIn)", 
we can calculate the semantic similarity 
between our recognized classes. For example, 
we can calculate the similarity between 
LOTI (DOID:0110644) and Andersen-Tawil 
syndrome (DOID:0050434) (see Fig.5.6), and 
the similarity between LOTZ and X-linked Alport 
syndrome (DOID:0110034) (see Fig. 5.7). 


Measures 


DiShIn provides the similarity values for three 
measures, namely Resnik, Lin and Jiang-Conrath 


Dhttp:;//labs.rd.ciencias.ulisboa.pt/dishin/ 
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D Dishin x + 
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labs.rd.ciencias.ulisboa.pt 


DiShln: Semantic Similarity Measures 
using Disjunctive Shared Information 


Ontology 


DO - Human Disease Ontology 


Entry 1 


DOID:0110644 


Examples: CHEBI:31236, DOID:2841, GO:0000023 (or protein Q12345), HP:0000588, gold, RID16139, or 


ambulance-noun-1 


Entry 2 


DOID:0050434 


Examples: CHEBI:3131, DOID:1324, GO:0000025 (or protein Q12346), HP:0001093, copper, RID16140, 


or motorcycle-noun-1 


Submit 


Measure 
Resnik DiShin 
Resnik MICA 
Lin DiShin 
Lin MICA 
JC DiShin 
JC MICA 


MICA/DiShIn 
intrinsic 
intrinsic 
intrinsic 
intrinsic 
intrinsic 
intrinsic 


(ExlIn)trinsic 


Similarity 
3.1715006566 
6.34300131319 
0.376553538118 
0.753107076235 
0.0952210062728 
0.240449173481 


Fig. 5.6 Semantic similarity between LOT] (DOID:0110644) and Andersen-Tawil syndrome (DOID:0050434) using 


the online tool DiShIn 


(Resnik 1995; Lin et al. 1998; Jiang and Conrath 
1997). The last two measures provide values 
between 0 and 1, and Jiang-Conrath is a distance 
measure that is converted to similarity. 

We can see that for all measures LOT] is much 
more similar to Andersen-Tawil syndrome than 
to X-linked Alport syndrome. Moreover, Jiang- 
Conrath's measure gives the only similarity value 
larger than zero for X-linked Alport syndrome, 
since it is a converted distance measure. We ob- 
tain similar results if we replace LOT] by LQT2, 
LQT3, LQTS, or LOT6. This means that by using 
semantic similarity we can identify Andersen- 
Tawil syndrome as the correct linked entity for 
the mention ATS in this text. 


DiShin Installation 


To automatize this process we can also execute 
DiShIn as a command line'*, however we may 
need to install python (or python3) and SQLite!>. 

First, we need to install it locally using the 
git command line: 


s git clone git://github.com/ 
lasigeBioTM/DiShIn.git 


The git command automatically retrieves a tool 
from the GitHub! software repository. 


'4https://github.com/lasigeBioTM/DiShIn 


Papt install python sqlite3 or 
install python3 sqlite3 


'Shttps://en.wikipedia.org/wiki/GitHub 


apt 


Entity Linking 


D Dishin x + 


E C Û A Notsecure 
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labs.rd.ciencias.ulisboa.pt 


DiShln: Semantic Similarity Measures 
using Disjunctive Shared Information 


Ontology 


DO - Human Disease Ontology 


Entry 1 


DOID:0110644 


Examples: CHEBI:31236, DOID:2841, GO:0000023 (or protein Q12345), HP:0000588, gold, RID16139, or 


ambulance-noun-1 


Entry 2 
DOID:0110034 


Examples: CHEBI:3131, DOID:1324, GO:0000025 (or protein Q12346), HP:0001093, copper, RID16140, 


or motorcycle-noun-1 


Submit 


Measure MICA/DiShIn 
Resnik DiShin 
Resnik MICA 

Lin DiShin 

Lin MICA 

Jc DiShin 


Jc MICA 


intrinsic 
intrinsic 
intrinsic 
intrinsic 
intrinsic 
intrinsic 


(ExlIn)trinsic 


Similarity 
0.0 

0.0 

0.0 

-0.0 
0.0593651994576 
0.0593651994576 


Fig. 5.7 Semantic similarity between LOTI (DOID:0110644) and X-linked Alport syndrome (DOID:0110034) using 


the online tool DiShIn 


If everything works fine, we should be able to 
see something like this in our display: 


Cloning into 'DiShIn'... 
Resolving deltas: 100% (255/255) 
, done. 


If the git command is not available, we can 
alternatively download the compressed file (zip), 
extract its contents and then move to the DiShIn 
folder: 


s curl -O -L https://github.com/ 
lasigeBioTM/DiShIn/archive 
/master.zip 

$ unzip master.zip 

$ mv DiShIn-master DiShIn 


The option -L enables the curl command to 
follow a URL redirection". The equivalent long 
form to the -L option is - - location. 

We now have to copy the Human Disease 
Ontology in to the folder using the cp command, 
and then enter into the DiShIn folder: 


$ cp doid.owl DiShIn/ 
s cd DiShIn 


Database File 


To execute DiShIn, we need first to convert the 
ontology file named doid.owl into a database 
(SQLite) file named doid.db: 


V'https://en.wikipedia.org/wiki/URL, redirection 
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$ python dishin.py doid.owl doid 
.db http://purl.obolibrary 
.org/obo/ http://www.w3. 
org/2000/01/rdf-schema# 
subClassOf '' 


If the module rdflib is not installed, the following 
error will be displayed: 


ImportError: No module named 


rdflib 


We can try to install it!®, but this will still take a 
few minutes to run. 

Alternatively, we can download the latest 
database version: 


$ curl -O http://labs.rd. 
ciencias.ulisboa.pt/book/ 
doid.db 


DiShIn Execution 


After being installed, we can execute DiShIn by 
providing the database and two classes identi- 
fiers: 


$ python dishin.py doid.db 
DOID 0110644 DOID 0050434 

$ python dishin.py doid.db 
DOID 0110644 DOID 0110034 


The output of the first command will be 
the semantic similarity values between LOT! 
(DOID:0110644) and Andersen-Tawil syndrome 
(DOID:0050434): 


Resnik DiShIn intrinsic 
3.1715006566 

Resnik MICA intrinsic 
6.34300131319 

Lin DiShIn intrinsic 
0.376553538118 

Lin MICA intrinsic 


0.753107076235 
JC DiShIn intrinsic 


0.0952210062728 
JC MICA intrinsic 0.240449173481 


'8https://github.com/RDFLib/rdflib 
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The output of the second command will be 
the semantic similarity values between LOT! 
(DOID:0110644) and X-linked Alport syndrome 
(DOID:01 10034): 


Resnik DiShIn intrinsic 0.0 

Resnik MICA intrinsic 0.0 

Lin DiShIn intrinsic 0.0 

Lin MICA intrinsic -0.0 

JC DiShIn intrinsic 
0.0593651994576 

JC MICA intrinsic 
0.0593651994576 


In the end, we should not forget to return to 
our parent folder: 


$ cd 


Learning python? and SQL” is out of scope 
of this book, but if we do not intend to make 
any modifications the above steps should be quite 
simple to execute. 


Large Lexicons 

The online tool MER is based on a shell script?! , 
so it can be easily executed as a command line to 
efficiently recognize and link entities using large 
lexicons. 


MER Installation 


First, we need to install it locally using the git 
command line: 


$ git clone git://github.com/ 
lasigeBioTM/MER.git 


If everything works fine, we should be able to see 
something like this in our display: 


Cloning into 'MER'... 


Resolving deltas: 100$ 
(604/604), done. 


!https://www.w3schools.com/python/ 
20https://www.w3schools.com/sql/ 
2! https://github.com/lasigeBioTM/MER 
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If the git command is not available, we can 
alternatively download the compressed file (zip), 
and extract its contents: 


s curl -O -L https://github.com/ 
lasigeBioTM/MER/archive/ 
master.zip 

$ unzip master.zip 

s mv MER-master MER 


We now have to copy the Human Disease 
Ontology in to the data folder of MER, and then 
enter into the MER folder: 


s cp doid.owl MER/data/ 
s cd MER 


Lexicon Files 


To execute MER, we need first to create the 
lexicon files: 


s (cd data; ../ 
produce data files.sh doid 
.owl) 


This may take a few minutes to run. However, we 
only need to execute it once, each time we want 
to use a new version of the ontology. If we wait, 
the output will include the last patterns of each of 
the lexicon files. 

Alternatively, we can download the lexicon 
files, and extract them into the data folder: 


$ curl -O http://labs.rd. 
ciencias.ulisboa.pt/book/ 
doid lexicons.zip 

s unzip doid lexicons.zip -d 
data/ 


We can check the contents of the created 
lexicons by using the tail command: 


$ tail data/doid« 


These patterns are created according to the num- 
ber of words of each term. 

The output should be 
this: 


something like 
==> data/doid links.tsv <== 
zika virus disease http://purl. 
obolibrary.org/obo/ 
DOID 0060478 
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zikv congenital infection http 
://purl.obolibrary.org/obo/ 
DOID 0080180 

zinacef allergy http://purl. 
obolibrary.org/obo/ 
DOID 0040025 

zinsser-cole-engman syndrome 
http://purl.obolibrary.org/ 
obo/DOID 0070025 

ziziphus mauritiana fruit 
allergy http://purl. 
obolibrary.org/obo/ 
DOID 0060507 

zlotogora-zilberman-tenenbaum 
syndrome http://purl. 
obolibrary.org/obo/ 
DOID 0060773 

zollinger-ellison syndrome http 
://purl.obolibrary.org/obo/ 
DOID 0050782 

zoophilia http://purl.obolibrary 
.org/obo/DOID 9336 

zoophobia http://purl.obolibrary 
.org/obo/DOID 600 

zygomycosis http://purl. 
obolibrary.org/obo/DOID 8485 


--» data/doid.txt «-- 

zika virus disease 

zikv congenital infection 

zinacef allergy 

zinsser-cole-engman syndrome 

ziziphus mauritiana fruit 
allergy 

zlotogora-zilberman-tenenbaum 
syndrome 


zollinger-ellison syndrome 
zoophilia 

zoophobia 

zygomycosis 


--» data/doid wordl.txt «-- 
xph 

xpid 

XPV 

xscid 

yaba 
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yaws 
zaspopathy 
zoophilia 
zoophobia 
zygomycosis 


==> data/doid word2.txt <== 

yunis.varon syndrome 

zantac allergy 

zebrafish allergy 

zellweger syndrome 

zemuron allergy 

Zika fever 

zinacef allergy 

zinsser.cole.engman syndrome 

zlotogora.zilberman.tenenbaum 
syndrome 

zollinger.ellison syndrome 


==> data/doid words2.txt <== 
yersinia infectious 
yersinia pestis 

yersinia pseudotuberculosis 
y.linked monogenic 

y.linked sertoli 

y.linked spermatogenic 

yolk sac 

zika virus 

zikv congenital 

ziziphus mauritiana 


--» data/doid words.txt «-- 
y.linked spermatogenic failure 1 
y.linked spermatogenic failure 2 


yolk sac neoplasm 

yolk sac tumor 

yolk sac tumor of mediastinum 

yolk sac tumor of the cns 

zika virus congenital syndrome 

zika virus disease 

zikv congenital infection 

ziziphus mauritiana fruit 
allergy 
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MER Execution 


Now we are ready to execute MER, by providing 
each sentence from the file chebi 27732 senten- 
ces.txt as argument to its get. entities.sh script. 


$ cat ../chebi 27732 sentences. 
txt | tr -d "'" | xargs -I 
{} ./get entities.sh '()' 


doid 


We removed single quotes from the text, since 
they are special characters to the command 
line xargs. We should note that this is the 
get entities.sh script inside the MER folder, not 
the one we created before. 

Now we will be able to obtain a large number 
of matches: 


89 111 malignant hyperthermia 
http://purl.obolibrary.org/ 
obo/DOID 8545 


74 96 malignant hyperthermia 
http://purl.obolibrary.org/ 
obo/DOID 8545 

157 164 disease http://purl. 
obolibrary.org/obo/DOID 4 

144 164 central core disease 
http://purl.obolibrary.org/ 
obo/DOID 3529 

13 20 disease http://purl. 
obolibrary.org/obo/DOID 4 

47 55 myopathy http://purl. 
obolibrary.org/obo/DOID 423 


The first two numbers represent the start and end 
position of the match in the sentence. They are 
followed by the name of the disease and its URI 
in the ontology. 

We can also redirect the output to a TSV file 
named diseases  recognized.tsv: 


s cat ../chebi 27732 sentences. 

txt | tr -d "'" | xargs -I 
() ./get entities.sh '()' 
doid » ../ 


diseases recognized.tsv 


Further Reading 


AB C 
1 89 111 malignant hyperthermia 
2 74 96malignant hyperthermia 
3 157 164 disease 
4 144 164 central core disease 
5 13 20disease 
6 47 55 myopathy 
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D 
http://purt.obolibrary.org/obo/DOID 8545 
http-//purt.obolibrary.org/obo/DOID 8545 
http://purt.obolibrary.org/obo/DOID 4 
http-//purt.obolibrary org/obo/DOID 3529 
http://purt.obolibrary.org/obo/DOID 4 
http://purl.obolibrary.org/obo/DOID 423 


Fig. 5.8 The diseases recognized.tsv file opened in a spreadsheet application 


We can now open the file in our spreadsheet 
application, such as LibreOffice Calc or Mi- 
crosoft Excel (see Fig. 5.8). 

Again, we should not forget to return to our 
parent folder in the end: 


s cd 


Further Reading 


To know more about biomedical ontologies, 
the book entitled Introduction to bio-ontologies 
is an excellent option, covering most of 
the ontologies and computational techniques 
exploring them (Robinson and Bauer 2011). 

Another approach is to read and watch the 
materials of the training course given by Barry 
Smith”. 


>? http://ontology.buffalo.edu/smith/ 
IntroOntology. Course.html 
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